Clustering K Means and KNEE Bend Technique

Customer clustering using KMeans and Knee Bend Techniques using Python

About the dataset:

Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. Customer Segmentation can be a powerful means to identify unsatisfied customer needs. Using the above data companies can then outperform the competition by developing uniquely appealing products and services.

You are the owner of a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. You want to understand the customers, like who are the target customers so that the sense can be given to the marketing team and plan the strategy accordingly.

Step 1 : Importing Required Libraries and packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler

Step 2 : Reading the dataset using Pandas:

customers = pd.read_csv('/Users/maheshg/Library/customer_clustering/segmentation data.csv')

Preprocessing: This preprocessing helps to remove the magnitude of the dataset and scale dominated by 1 variable.

Validate whether any of the variable is dominant to another variable and this gives us the confirmation whether to pursue the preprocessing or not?

Step 3: Visualize the existing variables using catplot()

sns.catplot(data=customers, kind='box', aspect=5)
plt.show()

Step 4 : Selecting the features as numbers and copy it in the local variable

customers_features = customers.select_dtypes('number').copy()

Step 5: Check the variables in the customers_features object using info()

customers_features.info()

Step 6: Standardise using StandardScaler() method which is INITIALISE, FIT AND TRANSFORM

from sklearn.preprocessing import StandardScaler
Xcustomers = StandardScaler().fit_transform(customers_features)

customers_features.shape

(6, 8)

Xcustomers.shape

(6, 8)

type(Xcustomers)

numpy.ndarray

Step 7: Visualize the category variables after standardizing and preprocessing, this will gives us the concrete plots

sns.catplot(data=pd.DataFrame(data=Xcustomers, columns=customers_features.columns), kind='box',
            aspect=2)
plt.show()
Note: In the above category plot, sex and marital status are not numeric value. In the above condition, we have taken only number value hence we have identified the clusters from the dataset. 

Step 8: K Means Introduction

customers.head(5)

Validate whether NA values are there before defining the number of clusters

customers.isna().sum()

Step 9 : Visualize the category plot after validation of NA values

sns.catplot(data=pd.DataFrame(data=Xcustomers, columns=customers_features.columns),kind='box',aspect=2)
plt.show()
customers_cluster_2 = KMeans(n_clusters=2, random_state=210, n_init=25, max_iter=500).fit_predict(Xcustomers)
customers['k2'] = pd.Series(data=customers_cluster_2, index=customers.index).astype('category')
customers.info()

Step 10: Checking the cluster K2 value counts

customers.k2.value_counts()

Step 11: Creation of Pairplot using 2 clusters from the dataset and interpret the variables using 2 clusters (hue = k2)

sns.pairplot(data=customers, hue='k2')
plt.show()

Example 2: Using 3 Clusters and create an variable k3 in the same dataset

customers_cluster_3 = KMeans(n_clusters=3, random_state=210, n_init=25, max_iter=50).fit_predict(Xcustomers)
customers['k3'] = pd.DataFrame(data=customers_cluster_3, index=customers.index).astype('category')
customers.k3.value_counts()
sns.pairplot(data=customers, hue='k3', diag_kws={'common_norm':False})
plt.show()

Step 12: Heatmap with 3 clusters

fig, ax = plt.subplots()

sns.heatmap(data=pd.crosstab(customers.Age, customers.k3),
            annot=True, annot_kws={'fontsize':25},
            ax=ax, fmt='g', cbar=False)

plt.show()

Step 13: Creating Optimal Number of Clusters

tots_within = []

K = range(1,7)

for k in K:
    km = KMeans(n_clusters=k,random_state=210,n_init=25, max_iter=500)
    km = km.fit(Xcustomers)

    tots_within.append(km.inertia_)

Visualize the KNEE BEND

fig, ax = plt.subplots()

ax.plot(K, tots_within,'bo-')
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Total Within Sum of Squares')
# ax.set_title('Customer Clustering based on available clusters from the dataset ')
ax.set_title('Customer Clustering based on available clusters from the dataset')
plt.show()