Clustering
Clustering Is a set of data driven partitioning techniques designed to group a collection of objects into clusters.
⚠️ Data should ALWAYS be continuous and standardized in nature.
✴ Clustering is finding borders between groups
✴ Segmentation is using borders to form groups
Applications :
- Market Segmentation
- Sales segmentation : what type of customer wants what
- Credit risk
- Operations : High performing persons and promotions
- Insurance : identifying groups with high average claim cost
- Data reduction : grouping observations to reduce number is obs
How to build clusters :
- Select distance measure
- Select clustering algorithm
- Define the distance between 2 clusters
- Determine no of clusters
- Validate the analysis
Methods
- Linkage method
- Variance method
- Centroid method
Closeness of two clusters :
The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding:
- Euclidean distance: (a-b)2 = √(Σ(ai-bi))
- Squared Euclidean distance: ((a-b)2)2 = Σ((ai-bi)2)
- Manhattan distance: (a-b)1 = Σ(ai-bi)
- Maximum distance: (a-b)INFINITY = maxi(ai-bi)
- Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}