Customer Segmentation Analysis using Python

Customer clustering using K Means and Principal Component Analysis (PCA)

About the dataset:

A response model can provide a significant boost to the efficiency of a marketing campaign by increasing responses or reducing expenses. The objective is to predict who will respond to an offer for a product or service

Step 1 : Importing Required Libraries and packages

#TODO: Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

Step 2

We will use pandas to read the data from the csv file using the read_csv function. This function returns a pandas dataframe. We will store this dataframe in a variable called df.

df = pd.read_csv('/Users/maheshg/Dropbox/Sample Datasets Kaggle/Customer Segment Analysis/marketing_campaign.csv', delimiter=';')
df.info()
df.describe
df.head()
df.shape
df.size

Step 3: Let's prepare our data for analysis. Follow the steps below to review the first 5 rows of your dataset, display column names, and get other basic information about the dataset

df.head(n=5)
df.shape
# TODO: Check for duplicates.
df.duplicated()
df.rename(columns={'Year_Birth':'Birth Year','Marital_Status':'Marital'})

Step 4 : The goal of this step is to perform feature engineering as required and drop the features that are irrelevant. To better understand feature engineering

# Create a new column named 'Age'.
df['Age'] = 2022 - df['Year_Birth']

# Create a new column for all the accepted campaigns.
df['Accepted_Campaigns'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5']

#  Create a new column for all the items.
df['Total_Items'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']

# Create a new column for all the purchases.
df['Total_Purchases'] = df['NumDealsPurchases'] + df['NumWebPurchases'] + df['NumCatalogPurchases'] + df['NumStorePurchases']

# Display the dataframe with the updated columns.
df.head()
 #  Drop irrelevant features.
df_new = df.drop(['Dt_Customer', 'Education', 'Marital_Status', 'Year_Birth', 'ID'], axis = 1)

# Display the dataframe.
df_new.head()

Step 5 : Data Visualisation using seaborn library

# Plot distributions for the relevant columns and check for outliers.
# Boxplot for `Income` distribution.
plt.figure(figsize = (8,5))
sns.boxplot(df, x = 'Income',  color = 'skyblue')
plt.title('Income Distribution');
# Calculate the Interquartile range (IQR) for the `Income` column.
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1

#  Identify the outliers in the Income column
outliers = df[(df['Income'] < (Q1 - 1.5 * IQR)) | (df['Income'] > (Q3 + 1.5 * IQR))]

# Print the number of outliers
print("Outliers in the Income column:", len(outliers))

Outliers in the Income column: 8

# Remove the outliers in the `Income` column.
df = df[~((df['Income'] < (Q1 - 1.5 * IQR)) | (df['Income'] > (Q3 + 1.5 * IQR)))]
# Plot Histograms for the important columns.
fig, axes = plt.subplots(nrows = 3, ncols = 2, figsize = (10,10))

# Histogram for `Income` distribution.
sns.histplot(df, x = 'Income', color = 'skyblue', bins = 50, ax = axes[0,0])
axes[0,0].set_title('Income Distribution')

#  Histogram for `Age` distribution.
sns.histplot(df, x = 'Age', color = 'orange', bins = 50, ax = axes[0,1])
axes[0,1].set_title('Age Distribution')

#  Histogram for `Kidhome` distribution.
sns.histplot(df, x = 'Kidhome', color = 'green', ax = axes[1,0])
axes[1,0].set_title('Number of children by household')

#  Histogram for `Teenhome` distribution.
sns.histplot(df, x = 'Teenhome', color = 'purple', ax = axes[1,1])
axes[1,1].set_title('Teenhome Distribution')
 
#  Histogram for `Education` distribution.
sns.histplot(df, x = 'Education', color ='red', ax = axes[2,0])
axes[2,0].set_title('Education Distribution')

#  Histogram for `Marital_Status` distribution.
sns.histplot(df, x = 'Marital_Status', color = 'brown', ax = axes[2,1])
axes[2,1].set_title('Marital Status Distribution')
plt.xticks(rotation = 45)

plt.tight_layout();
# Distributions for `Accepted_Campaigns`, `Total_Items`, and `Total_Purchases columns.
fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (14,4))

sns.histplot(df, x = 'Accepted_Campaigns', color = 'skyblue', ax = axes[0])
axes[0].set_title('Accepted Campaign Distribution')

sns.histplot(df, x = 'Total_Items', color = 'pink', ax = axes[1])
axes[1].set_title('Total Item Distribution')

sns.histplot(df, x = 'Total_Purchases', color = 'green', ax = axes[2])
axes[2].set_title('Total Purchases Distribution')
plt.tight_layout();
fig, axes = plt.subplots(1, 2, figsize = (10,5))

# TODO: Bar plot for `Total_Purchases` by `Education`.
df1 = df.groupby(['Education'])['Total_Purchases'].mean().reset_index()
sns.barplot(df1, x = 'Education', y = 'Total_Purchases', ax = axes[0])
axes[0].set_title('Total Purchases by Education')

# TODO: Bar plot for `Total_Purchases` by `Marital_Status`.
df2 = df.groupby(['Marital_Status'])['Total_Purchases'].mean().reset_index()
sns.barplot(df2, x = 'Marital_Status', y = 'Total_Purchases', ax = axes[1])
axes[1].set_title('Average Total Purchases by Marital Status')
plt.xticks(rotation = 45)

plt.tight_layout();

Step 6 : Now that we have prepared the data and performed exploratory data analysis (EDA), we will now begin with one-hot encoding to encode the categorical variables in the dataset, followed by data scaling.

df = df.select_dtypes(include=['number'])
df = df.fillna(df.mean())
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[df.select_dtypes(include=['number']).columns] = scaler.fit_transform(df.select_dtypes(include=['number']))
#  Perform data scaling using StandardScaler function.
scaler = StandardScaler()
scaler.fit(df)
df = pd.DataFrame(scaler.transform(df), columns=df.columns)

Step 7 : Initialize Principle Component Analysis

# Initialize and fit the PCA model.
pca = PCA(n_components = 3)
pca.fit(df)
PCA_df = pd.DataFrame(pca.transform(df), columns=(["Group_1","Group_2", "Group_3"]))
PCA_df.describe().T
x = PCA_df["Group_1"]
y = PCA_df["Group_2"]
z = PCA_df["Group_3"]

fig = plt.figure(figsize = (10,8))
ax = fig.add_subplot(111, projection = "3d")
ax.scatter(x, y, z, c = "hotpink")
ax.set_title("3D Projection Of Data after performing PCA")
plt.show()
#  Use Elbow method to determine the best number of clusters.
wcss = []

for k in range(1, 15):
    kmeans = KMeans(n_clusters = k, random_state = 42)
    kmeans.fit(PCA_df)
    wcss.append(kmeans.inertia_)

plt.figure()
plt.plot(range(1,15), wcss)
plt.xticks(range(1,15))
plt.xlabel("Number of Clusters)")
plt.ylabel("WCSS (Within Cluster Sum of Squares)")
plt.show()