- Business & Data Research
- Posts
- Exploring Probability Distributions
Exploring Probability Distributions
Probability Distribution - Data Professionals in India and Salary Range
Exploration of Probability Distribution (Data Professionals)
The ability to determine which type of probability distribution best fits the data, calculate Z-score and detect the outliers. These capabilities enable data professionals to understand how their data is distributed and identify the data points that need further examination
In this case, we are going to take data set from the market and determine the salary details of data professionals in India specifically in Bangalore (Silicon Valley)
Step 1 : Importing the required libraries, packages and modules:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
Step 2 : Reading the raw data and exploring the existing dataset :
#Data Exploration Phase :
# Reading the data set (source is kaggle):
dataprofession = pd.read_csv("/Users/maheshg/Library/CloudStorage/OneDrive-Microsoft365/Sample Datasets Kaggle/Partially Cleaned Salary Dataset.csv")
#viewing the first head of the dataprofession again this is random list, because we did not mention any filter in the header
print(dataprofession.head(10))
Step 3 : Computing the measure of dispersion and mean value of the data professionals. In other words, the following values are known as statistical tests
##Computing the empirical Value and ensure whether this falls under 1/2/ 3 standard deviation
result_mean = np.mean(dataprofession["Salary"])
print(result_mean)
result_std = np.std(dataprofession["Salary"])
print(result_std)
#1 Standard Deviation below the mean - Mean - (1*Standard Deviation)
single_below_standard_deviation = result_mean - (1*result_std)
print(single_below_standard_deviation)
#2 1 Standard Deviation above the mean = Mean + (1*Standard Deviation)
single_above_standard_deviation = result_mean + (1*result_std)
print(single_above_standard_deviation)
# Checking the upper limit and lower limit value of the data profession salary details
upper_limit = result_mean + 1 * result_std
print(upper_limit)
lower_limit = result_mean - 1 * result_std
print(lower_limit)
print(((dataprofession['Salary'] >= lower_limit) &
(dataprofession['Salary']<= upper_limit)).mean()) #this is 3 Standard Deviation
#computing the Z-Score value :
dataprofession['Z_Score'] = stats.zscore(dataprofession['Salary'])
# print(dataprofession['Z_Score'])
# print(dataprofession.describe())
print(dataprofession[(dataprofession['Z_Score'] >3 |(dataprofession['Z_Score'] <-3))])