- Business & Data Research
- Posts
- Empirical Distribution with an example using Python
Empirical Distribution with an example using Python
Electric Vehicle and Distributed Values - Sample and Population using Python
Electric Vehicle and Empirical Distribution Stats Model:
An empirical distribution is a distribution that is constructed based on observed data. It represents the frequencies or probabilities of the occurrence of data points from the sample data, rather than relying on any theoretical models or distributions. It's a non-parametric way of summarizing data. In simple terms, an empirical distribution shows how the actual data which you have is distributed and it is often visualized using histograms or empirical cumulative distribution functions (ECDF)
About the dataset :
Certainly! It's fascinating to explore datasets that offer insights into the adoption and registration of Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs). This dataset, sourced from the Washington State Department of Licensing (DOL), presents a rich tapestry of information reflecting the growing trends and preferences in sustainable transportation within the state.
Step 1 : Importing Required Libraries and packages
Step 2 : Reading the dataset using Pandas:
Step 3 : Describing the dataset using pandas library:
Step 4 : Reviewing the head of the EV dataset
Step 5 : Converting the datatype format into float for ease cleaning and understanding
Step 6 : Generate the population mean and standard deviation from the main dataset which is EV and I’m taking the DOL vehicle ID which is unique id from the dataset
Step 7: Generating large set of observations from the overall population
Step 8 : Converting the observations ev vehicle value into integer to avoid the scipi or exponential values:
Step 9 : Use the seaborn package to plot the standard deviation, here we have declared single and double standard deviation of the EV dataset.
sns.displot(observations_ev_vehicle)
plt.axvline(np.mean(observations_ev_vehicle) + np.std(observations_ev_vehicle), color = "g")
plt.axvline(np.mean(observations_ev_vehicle) - np.std(observations_ev_vehicle), color = "g")
plt.axvline(np.mean(observations_ev_vehicle) + np.std(observations_ev_vehicle) * 2, color = "y")
plt.axvline(np.mean(observations_ev_vehicle) - np.std(observations_ev_vehicle) * 2, color = "y")
Step 10 : Taking another view of the dataset using stats models library
from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt
ecdf = ECDF(observations_ev_vehicle)
plt.plot(ecdf.x, ecdf.y)
plt.axhline(y = 0.025, color = 'y', linestyle='-')
plt.axvline(x = np.mean(observations_ev_vehicle) - (2 * np.std(observations_ev_vehicle)), color = 'y', linestyle='-')
plt.axhline(y = 0.975, color = 'y', linestyle='-')
plt.axvline(x = np.mean(observations_ev_vehicle)+ (2 * np.std(observations_ev_vehicle)), color= 'y', linestyle='-')
Conclusion : Based on the above empirical distribution, this graph visualizes the distribution of a large dataset, highlighting how data points are spread around the mean and the frequency of different values. The vertical lines help indicate statistical thresholds, such as standard deviations, which are useful for understanding the variability and significance of the data.
Two yellow lines represents the 2 standard deviation or confidence intervals from the mean whereas 2 green lines represents the single standard deviation or confidence from the mean. Overall, we could see only less amount of people are using EV vehicles from this dataset which is in the range of 68% only.