- Business & Data Research
- Posts
- Clustering K Means and KNEE Bend Technique
Clustering K Means and KNEE Bend Technique
Customer clustering using KMeans and Knee Bend Techniques using Python
About the dataset:
Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. Customer Segmentation can be a powerful means to identify unsatisfied customer needs. Using the above data companies can then outperform the competition by developing uniquely appealing products and services.
You are the owner of a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. You want to understand the customers, like who are the target customers so that the sense can be given to the marketing team and plan the strategy accordingly.
Step 1 : Importing Required Libraries and packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
Step 2 : Reading the dataset using Pandas:
customers = pd.read_csv('/Users/maheshg/Library/customer_clustering/segmentation data.csv')
Preprocessing: This preprocessing helps to remove the magnitude of the dataset and scale dominated by 1 variable.
Validate whether any of the variable is dominant to another variable and this gives us the confirmation whether to pursue the preprocessing or not?
Step 3: Visualize the existing variables using catplot()
sns.catplot(data=customers, kind='box', aspect=5)
plt.show()

Step 4 : Selecting the features as numbers and copy it in the local variable
customers_features = customers.select_dtypes('number').copy()
Step 5: Check the variables in the customers_features object using info()
customers_features.info()

Step 6: Standardise using StandardScaler() method which is INITIALISE, FIT AND TRANSFORM
from sklearn.preprocessing import StandardScaler
Xcustomers = StandardScaler().fit_transform(customers_features)
(6, 8)
Xcustomers.shape
(6, 8)
type(Xcustomers)
numpy.ndarray
Step 7: Visualize the category variables after standardizing and preprocessing, this will gives us the concrete plots
sns.catplot(data=pd.DataFrame(data=Xcustomers, columns=customers_features.columns), kind='box',
aspect=2)
plt.show()

Note: In the above category plot, sex and marital status are not numeric value. In the above condition, we have taken only number value hence we have identified the clusters from the dataset.
Step 8: K Means Introduction
customers.head(5)

Validate whether NA values are there before defining the number of clusters
customers.isna().sum()

Step 9 : Visualize the category plot after validation of NA values
sns.catplot(data=pd.DataFrame(data=Xcustomers, columns=customers_features.columns),kind='box',aspect=2)
plt.show()

customers_cluster_2 = KMeans(n_clusters=2, random_state=210, n_init=25, max_iter=500).fit_predict(Xcustomers)
customers['k2'] = pd.Series(data=customers_cluster_2, index=customers.index).astype('category')
customers.info()

Step 10: Checking the cluster K2 value counts
customers.k2.value_counts()

Step 11: Creation of Pairplot using 2 clusters from the dataset and interpret the variables using 2 clusters (hue = k2)
sns.pairplot(data=customers, hue='k2')
plt.show()

Example 2: Using 3 Clusters and create an variable k3 in the same dataset
customers_cluster_3 = KMeans(n_clusters=3, random_state=210, n_init=25, max_iter=50).fit_predict(Xcustomers)
customers['k3'] = pd.DataFrame(data=customers_cluster_3, index=customers.index).astype('category')
customers.k3.value_counts()

sns.pairplot(data=customers, hue='k3', diag_kws={'common_norm':False})
plt.show()

Step 12: Heatmap with 3 clusters
fig, ax = plt.subplots()
sns.heatmap(data=pd.crosstab(customers.Age, customers.k3),
annot=True, annot_kws={'fontsize':25},
ax=ax, fmt='g', cbar=False)
plt.show()

Step 13: Creating Optimal Number of Clusters
tots_within = []
K = range(1,7)
for k in K:
km = KMeans(n_clusters=k,random_state=210,n_init=25, max_iter=500)
km = km.fit(Xcustomers)
tots_within.append(km.inertia_)
Visualize the KNEE BEND
fig, ax = plt.subplots()
ax.plot(K, tots_within,'bo-')
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Total Within Sum of Squares')
# ax.set_title('Customer Clustering based on available clusters from the dataset ')
ax.set_title('Customer Clustering based on available clusters from the dataset')
plt.show()
