Business & Data Research
Posts
Social Media and Mental Health Awareness: Classification Use Case

Social Media and Mental Health Awareness: Classification Use Case

Social Media Dataset using Classification, KNN algorithm, Ensemble Learning

Mahesh Gurumoorthi
February 07, 2026

About the dataset :

This dataset was collected via a survey to investigate the relationship between social media usage habits and mental well-being. The study focuses on understanding how platform preference, daily usage time, and interaction patterns (such as checking notifications or engaging in arguments) correlate with user stress levels, sleep quality, and academic performance.

This data is particularly useful for exploring the psychological impact of digital habits, identifying "high-risk" user segments, and building predictive models for mental health trends related to technology.

Context:

The primary goal is to discover which social media habits are linked to high stress or low mood and to promote healthier online behavior

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Step 2: Reading the dataset using Pandas:

social_media_df = pd.read_csv('/Users/maheshg/Dropbox/Sample Datasets Kaggle/Social Media and Mental Health.csv')

social_media_df.head()

social_media_df.columns
Index(['Timestamp', 'Gender ?', 'Your City',
       'On average, how many hours per day do you spend on social media?',
       'What is your age?', 'Current Occuption',
       'Which social media platform do you use the most?',
       'What type of content do you consume most? (Select all that apply)',
       'How many hours of sleep do you get on average per night?',
       'Do you use social media right before sleeping?  ',
       'Do you check social media immediately after waking up?  ',
       'When you receive a notification while studying, what is your immediate reaction?',
       '  What is the specific thing about social media that causes you the most stress or anxiety?  ',
       'How often do you find yourself comparing your life to others?',
       'Do you feel "FOMO" (Fear of Missing Out) when you are offline?',
       'In a few words, describe how you feel when you see your friends having fun without you on social media.',
       'With how many specific people do you interact (DM/Tag) on a daily basis?',
       'How would you rate your current daily stress level?',
       'What is your current CGPA range? ',
       '  Why do you usually open social media when you are supposed to be studying?  ',
       '  Do you use social media to escape or forget about your real-life problems?  ',
       '  How often do you get into arguments or heated discussions in comment sections?  ',
       '  How does it affect your mood if a post you made gets fewer likes than you expected?  ',
       'Have you ever experienced any negative events on social media? (cyberbullying, harassment, hate comments, etc.)  ',
       'Do you have any medically approved mental disorder?'],
      dtype='object')

print(social_media_df_copy.columns.tolist())
['Timestamp', 'Gender ?', 'Your City', 'On average, how many hours per day do you spend on social media?', 'What is your age?', 'Current Occuption', 'Which social media platform do you use the most?', 'What type of content do you consume most? (Select all that apply)', 'How many hours of sleep do you get on average per night?', 'Do you use social media right before sleeping?  ', 'Do you check social media immediately after waking up?  ', 'When you receive a notification while studying, what is your immediate reaction?', '  What is the specific thing about social media that causes you the most stress or anxiety?  ', 'How often do you find yourself comparing your life to others?', 'Do you feel "FOMO" (Fear of Missing Out) when you are offline?', 'In a few words, describe how you feel when you see your friends having fun without you on social media.', 'With how many specific people do you interact (DM/Tag) on a daily basis?', 'How would you rate your current daily stress level?', 'What is your current CGPA range? ', '  Why do you usually open social media when you are supposed to be studying?  ', '  Do you use social media to escape or forget about your real-life problems?  ', '  How often do you get into arguments or heated discussions in comment sections?  ', '  How does it affect your mood if a post you made gets fewer likes than you expected?  ', 'Have you ever experienced any negative events on social media? (cyberbullying, harassment, hate comments, etc.)  ', 'Do you have any medically approved mental disorder?']

# Encode categorical columns

from sklearn.preprocessing import LabelEncoder,StandardScaler from sklearn.cluster import KMeans

label_encoder = {}

for col in social_media_cluster.columns: if social_media_cluster[col].dtype == 'object': le = LabelEncoder() social_media_cluster[col] = le.fit_transform(social_media_cluster[col].astype(str)) label_encoder[col] = le

### Scale the data : 
scaler = StandardScaler()
scaled_data = scaler.fit_transform(social_media_cluster)

### Run K means clustering :
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(scaled_data)

social_media_cluster['cluster'] = kmeans.fit_predict(scaled_data)

cluster_data = social_media_df_copy[cluster_cols]
cluster_data['cluster'] = social_media_cluster['cluster']

# Elbow (k-bend) method, automatic elbow detection

# Elbow (k-bend) method, automatic elbow detection, and final clustering summary # Use existing variables: scaled_data, scaler, social_media_df_copy, cluster_cols, label_encoder, KMeans, plt, np, pd # 1. Compute inertia for k range k_range = range(1, 11) inertias = [] for k in k_range: km = KMeans(n_clusters=k, random_state=42) km.fit(scaled_data) inertias.append(km.inertia_) # Plot inertia vs k (elbow plot) plt.figure(figsize=(8, 4)) plt.plot(list(k_range), inertias, '-o') plt.xlabel('k') plt.ylabel('Inertia') plt.title('Elbow method (Inertia vs k)') plt.xticks(list(k_range)) plt.grid(True) plt.show() # Automatic elbow detection using max distance to line method # convert to 2D points (k, inertia) pts = np.column_stack((np.array(list(k_range)), np.array(inertias))) p1 = pts[0] p2 = pts[-1] # vector from p1 to p2 v = p2 - p1 # compute distances from each point to the line through p1-p2 # distance = norm(cross(v, p_i - p1)) / norm(v) in 2D reduces to area formula norm_v = np.linalg.norm(v) if norm_v == 0: elbow_k = 1 else: distances = np.abs(np.cross(v, pts - p1)) / norm_v elbow_idx = np.argmax(distances) elbow_k = int(k_range[elbow_idx]) print(f"Detected elbow (k) = {elbow_k}") # Fit final KMeans with detected k and attach labels to original data kmeans_opt = KMeans(n_clusters=elbow_k, random_state=42) labels_opt = kmeans_opt.fit_predict(scaled_data) final_cluster_data = social_media_df_copy[cluster_cols].copy() final_cluster_data['cluster'] = labels_opt # Basic cluster counts and numeric summary print("\nCluster counts:") print(final_cluster_data['cluster'].value_counts().sort_index()) print("\nCluster-wise mean hours on social media:") print(final_cluster_data.groupby('cluster')['On average, how many hours per day do you spend on social media?'].mean()) # Decode cluster centers to human-readable (inverse transform for encoded categorical features) centers_scaled = kmeans_opt.cluster_centers_ centers_unscaled = scaler.inverse_transform(centers_scaled) # back to encoded/original space centers_df = pd.DataFrame(centers_unscaled, columns=cluster_cols) def decode_center_row(row): decoded = {} for col in centers_df.columns: val = row[col] # if this column was label-encoded, inverse transform the nearest integer label if col in label_encoder and hasattr(label_encoder[col], 'inverse_transform'): # round to nearest integer and clip label_vals = np.arange(len(label_encoder[col].classes_)) int_val = int(np.clip(np.round(val), label_vals.min(), label_vals.max())) decoded[col] = label_encoder[col].inverse_transform([int_val])[0] else: # numeric column (hours) decoded[col] = float(round(val, 2)) return decoded decoded_centers = pd.DataFrame([decode_center_row(centers_df.iloc[i]) for i in range(centers_df.shape[0])]) decoded_centers.index.name = 'cluster' print("\nDecoded cluster centers (approximate):") print(decoded_centers) # Cluster-wise mode for categorical columns and numeric summary cat_cols = [c for c in cluster_cols if final_cluster_data[c].dtype == 'object'] num_cols = [c for c in cluster_cols if c not in cat_cols] mode_summary = final_cluster_data.groupby('cluster')[cat_cols].agg(lambda x: x.mode().iat[0] if not x.mode().empty else np.nan) num_summary = final_cluster_data.groupby('cluster')[num_cols].agg(['mean', 'count']) print("\nCluster categorical modes:") print(mode_summary) print("\nCluster numeric summary:") print(num_summary) # (Optional) Save final clustering result into a dataframe for downstream analysis final_cluster_data.reset_index(drop=True, inplace=True) final_cluster_data.to_csv('social_media_final_clusters.csv', index=False) print("\nFinal clustered data saved to 'social_media_final_clusters.csv'")

Conclusion:

This suggests using 3 clusters (k = {0}) as increasing k beyond this point yields diminishing reductions in inertia, so additional clusters likely add little explanatory value. Proceed with k = {0} and validate clusters by examining cluster sizes, decoded centers, and domain relevance.