- Business & Data Research
- Posts
- Customer Segmentation Analysis using Python
Customer Segmentation Analysis using Python
Customer clustering using K Means and Principal Component Analysis (PCA)

About the dataset:
A response model can provide a significant boost to the efficiency of a marketing campaign by increasing responses or reducing expenses. The objective is to predict who will respond to an offer for a product or service
Step 1 : Importing Required Libraries and packages
#TODO: Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeansStep 2
We will use pandas to read the data from the csv file using the read_csv function. This function returns a pandas dataframe. We will store this dataframe in a variable called df.
df = pd.read_csv('/Users/maheshg/Dropbox/Sample Datasets Kaggle/Customer Segment Analysis/marketing_campaign.csv', delimiter=';')df.info()
df.describe
df.head()
df.shape
df.size
Step 3: Let's prepare our data for analysis. Follow the steps below to review the first 5 rows of your dataset, display column names, and get other basic information about the dataset
df.head(n=5)

df.shape




# TODO: Check for duplicates.
df.duplicated()df.rename(columns={'Year_Birth':'Birth Year','Marital_Status':'Marital'})
Step 4 : The goal of this step is to perform feature engineering as required and drop the features that are irrelevant. To better understand feature engineering
# Create a new column named 'Age'.
df['Age'] = 2022 - df['Year_Birth']
# Create a new column for all the accepted campaigns.
df['Accepted_Campaigns'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5']
# Create a new column for all the items.
df['Total_Items'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']
# Create a new column for all the purchases.
df['Total_Purchases'] = df['NumDealsPurchases'] + df['NumWebPurchases'] + df['NumCatalogPurchases'] + df['NumStorePurchases']
# Display the dataframe with the updated columns.
df.head()
# Drop irrelevant features.
df_new = df.drop(['Dt_Customer', 'Education', 'Marital_Status', 'Year_Birth', 'ID'], axis = 1)
# Display the dataframe.
df_new.head()
Step 5 : Data Visualisation using seaborn library
# Plot distributions for the relevant columns and check for outliers.
# Boxplot for `Income` distribution.
plt.figure(figsize = (8,5))
sns.boxplot(df, x = 'Income', color = 'skyblue')
plt.title('Income Distribution');
# Calculate the Interquartile range (IQR) for the `Income` column.
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
# Identify the outliers in the Income column
outliers = df[(df['Income'] < (Q1 - 1.5 * IQR)) | (df['Income'] > (Q3 + 1.5 * IQR))]
# Print the number of outliers
print("Outliers in the Income column:", len(outliers))Outliers in the Income column: 8
# Remove the outliers in the `Income` column.
df = df[~((df['Income'] < (Q1 - 1.5 * IQR)) | (df['Income'] > (Q3 + 1.5 * IQR)))]# Plot Histograms for the important columns.
fig, axes = plt.subplots(nrows = 3, ncols = 2, figsize = (10,10))
# Histogram for `Income` distribution.
sns.histplot(df, x = 'Income', color = 'skyblue', bins = 50, ax = axes[0,0])
axes[0,0].set_title('Income Distribution')
# Histogram for `Age` distribution.
sns.histplot(df, x = 'Age', color = 'orange', bins = 50, ax = axes[0,1])
axes[0,1].set_title('Age Distribution')
# Histogram for `Kidhome` distribution.
sns.histplot(df, x = 'Kidhome', color = 'green', ax = axes[1,0])
axes[1,0].set_title('Number of children by household')
# Histogram for `Teenhome` distribution.
sns.histplot(df, x = 'Teenhome', color = 'purple', ax = axes[1,1])
axes[1,1].set_title('Teenhome Distribution')
# Histogram for `Education` distribution.
sns.histplot(df, x = 'Education', color ='red', ax = axes[2,0])
axes[2,0].set_title('Education Distribution')
# Histogram for `Marital_Status` distribution.
sns.histplot(df, x = 'Marital_Status', color = 'brown', ax = axes[2,1])
axes[2,1].set_title('Marital Status Distribution')
plt.xticks(rotation = 45)
plt.tight_layout();
# Distributions for `Accepted_Campaigns`, `Total_Items`, and `Total_Purchases columns.
fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (14,4))
sns.histplot(df, x = 'Accepted_Campaigns', color = 'skyblue', ax = axes[0])
axes[0].set_title('Accepted Campaign Distribution')
sns.histplot(df, x = 'Total_Items', color = 'pink', ax = axes[1])
axes[1].set_title('Total Item Distribution')
sns.histplot(df, x = 'Total_Purchases', color = 'green', ax = axes[2])
axes[2].set_title('Total Purchases Distribution')
plt.tight_layout();
fig, axes = plt.subplots(1, 2, figsize = (10,5))
# TODO: Bar plot for `Total_Purchases` by `Education`.
df1 = df.groupby(['Education'])['Total_Purchases'].mean().reset_index()
sns.barplot(df1, x = 'Education', y = 'Total_Purchases', ax = axes[0])
axes[0].set_title('Total Purchases by Education')
# TODO: Bar plot for `Total_Purchases` by `Marital_Status`.
df2 = df.groupby(['Marital_Status'])['Total_Purchases'].mean().reset_index()
sns.barplot(df2, x = 'Marital_Status', y = 'Total_Purchases', ax = axes[1])
axes[1].set_title('Average Total Purchases by Marital Status')
plt.xticks(rotation = 45)
plt.tight_layout();
Step 6 : Now that we have prepared the data and performed exploratory data analysis (EDA), we will now begin with one-hot encoding to encode the categorical variables in the dataset, followed by data scaling.
df = df.select_dtypes(include=['number'])df = df.fillna(df.mean())from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[df.select_dtypes(include=['number']).columns] = scaler.fit_transform(df.select_dtypes(include=['number']))# Perform data scaling using StandardScaler function.
scaler = StandardScaler()
scaler.fit(df)
df = pd.DataFrame(scaler.transform(df), columns=df.columns)Step 7 : Initialize Principle Component Analysis
# Initialize and fit the PCA model.
pca = PCA(n_components = 3)
pca.fit(df)
PCA_df = pd.DataFrame(pca.transform(df), columns=(["Group_1","Group_2", "Group_3"]))
PCA_df.describe().T
x = PCA_df["Group_1"]
y = PCA_df["Group_2"]
z = PCA_df["Group_3"]
fig = plt.figure(figsize = (10,8))
ax = fig.add_subplot(111, projection = "3d")
ax.scatter(x, y, z, c = "hotpink")
ax.set_title("3D Projection Of Data after performing PCA")
plt.show()
# Use Elbow method to determine the best number of clusters.
wcss = []
for k in range(1, 15):
kmeans = KMeans(n_clusters = k, random_state = 42)
kmeans.fit(PCA_df)
wcss.append(kmeans.inertia_)
plt.figure()
plt.plot(range(1,15), wcss)
plt.xticks(range(1,15))
plt.xlabel("Number of Clusters)")
plt.ylabel("WCSS (Within Cluster Sum of Squares)")
plt.show()