Business & Data Research
Posts
Empirical Distribution with an example using Python

Empirical Distribution with an example using Python

Electric Vehicle and Distributed Values - Sample and Population using Python

Mahesh Gurumoorthi
January 02, 2025

Electric Vehicle and Empirical Distribution Stats Model:

An empirical distribution is a distribution that is constructed based on observed data. It represents the frequencies or probabilities of the occurrence of data points from the sample data, rather than relying on any theoretical models or distributions. It's a non-parametric way of summarizing data. In simple terms, an empirical distribution shows how the actual data which you have is distributed and it is often visualized using histograms or empirical cumulative distribution functions (ECDF)

About the dataset :

Certainly! It's fascinating to explore datasets that offer insights into the adoption and registration of Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs). This dataset, sourced from the Washington State Department of Licensing (DOL), presents a rich tapestry of information reflecting the growing trends and preferences in sustainable transportation within the state.

Step 1 : Importing Required Libraries and packages

import numpy as np import pandas as pd import matplotlib as plt import seaborn as sns

Step 2 : Reading the dataset using Pandas:

ev_dataset = pd.read_csv('/Users/Sample Datasets Kaggle/Electric_Vehicle_Population_Data.csv')

Step 3 : Describing the dataset using pandas library:

ev_dataset.describe

Step 4 : Reviewing the head of the EV dataset

ev_dataset.head()

Step 5 : Converting the datatype format into float for ease cleaning and understanding

ev_dataset['2020 Census Tract'] = ev_dataset['2020 Census Tract'].astype(float)

Step 6 : Generate the population mean and standard deviation from the main dataset which is EV and I’m taking the DOL vehicle ID which is unique id from the dataset

#Generate the population mean : mu = np.mean(ev_dataset['DOL Vehicle ID']) #Generate the standard deviation of the population : sd = np.std(ev_dataset['DOL Vehicle ID'])

Step 7: Generating large set of observations from the overall population

observations_ev_vehicle = np.random.normal(mu,sd, size=100000)

Step 8 : Converting the observations ev vehicle value into integer to avoid the scipi or exponential values:

observations_ev_vehicle = observations_ev_vehicle.astype(float).astype(int)

Step 9 : Use the seaborn package to plot the standard deviation, here we have declared single and double standard deviation of the EV dataset.

sns.displot(observations_ev_vehicle)

plt.axvline(np.mean(observations_ev_vehicle) + np.std(observations_ev_vehicle), color = "g")
plt.axvline(np.mean(observations_ev_vehicle) - np.std(observations_ev_vehicle), color = "g")

plt.axvline(np.mean(observations_ev_vehicle) + np.std(observations_ev_vehicle) * 2, color = "y")
plt.axvline(np.mean(observations_ev_vehicle) - np.std(observations_ev_vehicle) * 2, color = "y")

Step 10 : Taking another view of the dataset using stats models library

from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt

ecdf = ECDF(observations_ev_vehicle)

plt.plot(ecdf.x, ecdf.y)

plt.axhline(y = 0.025, color = 'y', linestyle='-')
plt.axvline(x = np.mean(observations_ev_vehicle) - (2 * np.std(observations_ev_vehicle)), color = 'y', linestyle='-')

plt.axhline(y = 0.975, color = 'y', linestyle='-')
plt.axvline(x = np.mean(observations_ev_vehicle)+ (2 * np.std(observations_ev_vehicle)), color= 'y', linestyle='-')

Conclusion : Based on the above empirical distribution, this graph visualizes the distribution of a large dataset, highlighting how data points are spread around the mean and the frequency of different values. The vertical lines help indicate statistical thresholds, such as standard deviations, which are useful for understanding the variability and significance of the data.

Two yellow lines represents the 2 standard deviation or confidence intervals from the mean whereas 2 green lines represents the single standard deviation or confidence from the mean. Overall, we could see only less amount of people are using EV vehicles from this dataset which is in the range of 68% only.