- Business & Data Research
- Posts
- Demographic Descriptive Statistics using Python
Demographic Descriptive Statistics using Python
They’re statistical summaries used to describe the characteristics of a population or sample, focusing on aspects like age, gender, income, education, ethnicity, and more.
Common Demographic Variables
Age: Usually presented as mean, median, mode, or grouped into age ranges (e.g., 18–24, 25–34).
Gender: Shown as counts or percentages (e.g., 55% female, 45% male).
Ethnicity or Race: Reported by category and frequency (e.g., 40% White, 30% Black, 20% Asian, etc.).
Education Level: Percentages of participants with high school, undergraduate, graduate degrees, etc.
Income: Mean, median, and ranges or income brackets.
Employment Status: Percentages employed, unemployed, retired, and students.
Geographic Location: By region, country, city, rural vs. urban.
Common Statistical Measure
Frequencies: How often each category appears.
Percentages: Proportion of the sample in each category.
Mean: Average value (e.g., average age).
Median: Middle value when data is ranked.
Standard Deviation: Variation around the mean.
Step 1: Required Packages Importing from the library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings("ignore")
Step 2 : Exploratory Data Analysis
population_data = pd.read_csv('/Users/Sample Datasets Kaggle/city_population.csv')
population_data.shape
(64, 10)
population_data.columns
Index(['Name', 'Abbr.', 'Division', 'Established', 'Native', 'Area (km2)',
'Population_1991', 'Population_2001', 'Population_2011',
'Population_2022'],
dtype='object')
population_data.head(n=10)
Name Abbr. Division Established Native Area (km2) Population_1991 Population_2001 Population_2011 Population_2022
0 Barguna BRG Barisal 1984 বরগুনা জেলা 1831 805000 887376 927889 1035596
1 Barishal BRS Barisal 1797 বরিশাল জেলা 2785 2299000 2465249 2414729 2634203
2 Bhola BHO Barisal 1984 ভোলা জেলা 3403 1532000 1781043 1846351 1980452
3 Jhalokati JHA Barisal 1984 ঝালকাঠি জেলা 707 694000 726182 709914 677559
4 Patuakhali PAT Barisal 1969 পটুয়াখালী জেলা 3221 1323000 1527628 1596223 1770096
5 Pirojpur PIR Barisal 1984 পিরোজপুর জেলা 1278 1104000 1161548 1157215 1227915
6 Bandarban BAN Chattogram 1981 বান্দরবান জেলা 4496 246000 311741 404091
population_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 64 non-null object
1 Abbr. 64 non-null object
2 Division 64 non-null object
3 Established 64 non-null int64
4 Native 64 non-null object
5 Area (km2) 64 non-null int64
6 Population_1991 64 non-null int64
7 Population_2001 64 non-null int64
8 Population_2011 64 non-null int64
9 Population_2022 64 non-null int64
dtypes: int64(6), object(4)
memory usage: 5.1+ KB
Step 3 : Checking Missing Values
population_data.isna().sum()
Name 0
Abbr. 0
Division 0
Established 0
Native 0
Area (km2) 0
Population_1991 0
Population_2001 0
Population_2011 0
Population_2022 0
dtype: int64
population_data.isnull().sum()
Name 0
Abbr. 0
Division 0
Established 0
Native 0
Area (km2) 0
Population_1991 0
Population_2001 0
Population_2011 0
Population_2022 0
dtype: int64
population_data.describe()
Established Area (km2) Population_1991 Population_2001 Population_2011 Population_2022
count 64.000000 64.000000 6.400000e+01 6.400000e+01 6.400000e+01 6.400000e+01
mean 1937.703125 2306.000000 1.741500e+06 2.039416e+06 2.340193e+06 2.653577e+06
std 84.372884 1184.492899 1.086475e+06 1.415458e+06 1.811774e+06 2.203057e+06
min 1666.000000 684.000000 2.460000e+05 3.117410e+05 4.040910e+05 4.952520e+05
25% 1963.500000 1379.250000 1.102250e+06 1.213935e+06 1.283516e+06 1.430418e+06
50% 1984.000000 2084.000000 1.545500e+06 1.829092e+06 2.008954e+06 2.215752e+06
75% 1984.000000 2960.750000 2.160750e+06 2.474170e+06 2.745524e+06 3.020196e+06
max 1984.000000 6116.000000 6.164000e+06 9.151343e+06 1.251736e+07 1.521085e+07
<class 'pandas.core.frame.DataFrame'>
population_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 64 non-null object
1 Abbr. 64 non-null object
2 Division 64 non-null object
3 Established 64 non-null int64
4 Native 64 non-null object
5 Area (km2) 64 non-null int64
6 Population_1991 64 non-null int64
7 Population_2001 64 non-null int64
8 Population_2011 64 non-null int64
9 Population_2022 64 non-null int64
dtypes: int64(6), object(4)
memory usage: 5.1+ KB
population_data.describe(include='object')
Name Abbr. Division Native
count 64 64 64 64
unique 64 64 8 64
top Barguna BRG Dhaka বরগুনা জেলা
freq 1 1 13 1
Step 4: Perform the Central Tendency Theorem
mean_population = population_data['Population_1991'].mean()
print(f"Mean Population in 1991: {mean_population}")
Mean Population in 1991: 1741500.0
mean_population = population_data['Population_2001'].mean()
print(f"Mean Population in 2001: {mean_population}")
Mean Population in 2001: 2039415.59375
mean_population = population_data['Population_2011'].mean()
print(f"Mean Population in 2011: {mean_population}")
Mean Population in 2011: 2340193.0
mean_population = population_data['Population_2022'].mean()
print(f"Mean Population in 2022: {mean_population}")
Mean Population in 2022: 2653576.765625
median_population = population_data['Population_1991'].median()
print(f"Median Population in 1991: {median_population}")
Median Population in 1991: 1545500.0
median_population = population_data['Population_2001'].median()
print(f"Median Population in 2001: {median_population}")
Median Population in 2001: 1829092.5
median_population = population_data['Population_2011'].median()
print(f"Median Population in 2011: {median_population}")
Median Population in 2011: 2008954.5
median_population = population_data['Population_2022'].median()
print(f"Median Population in 2022: {median_population}")
Median Population in 2022: 2215751.5
mode_population = population_data['Population_1991'].mode()
print(f"Mode Population in 1991: {mode_population[0]}")
Mode Population in 1991: 246000
mode_population = population_data['Population_2001'].mode()[0]
print(f"Mode Population in 2001: {mode_population}")
Mode Population in 2001: 311741
mode_population = population_data['Population_2011'].mode()[0]
print(f"Mode Population in 2011: {mode_population}")
Mode Population in 2011: 404091
mode_population = population_data['Population_2022'].mode()[0]
print(f"Mode Population in 2022: {mode_population}")
Mode Population in 2022: 495252
Step 5: Histogram View
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
years = ['Population_1991', 'Population_2001', 'Population_2011', 'Population_2022']
for ax, year in zip(axes.flatten(), years):
sns.histplot(population_data[year], bins=15, kde=True, ax=ax)
ax.set_title(f'Histogram of {year}')
ax.set_xlabel('Population')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
