- Business & Data Research
- Posts
- Exploratory Data Analysis -IMDB Dataset - Votes vs Score
Exploratory Data Analysis -IMDB Dataset - Votes vs Score
Statistical Analysis using Python programming with detailed information

Problem Statement and Data Set:
The problem statement focuses on analyzing IMDb scores by leveraging details from IMDb votes found in a Kaggle dataset. IMDb scores represent the overall user rating of a movie, typically ranging from 1 to 10, based on individual user votes. These votes capture audience reactions, preferences, and perceptions of a film’s quality.
The objective of this analysis is to examine the relationship between IMDb votes and the assigned IMDb scores, identifying key patterns, trends, and influencing factors that contribute to a movie’s final rating. This includes:
Evaluating the distribution of votes across different movies and genres.
Assessing how the number of votes impacts the IMDb score—whether higher vote counts lead to more stable ratings.
Identifying anomalies, such as movies with exceptionally high or low scores relative to their vote distribution.
Understanding possible biases in user ratings based on factors like genre, popularity, or external reviews.
Exploring statistical techniques, including correlation analysis and predictive modeling, to estimate IMDb scores from vote-related metrics.
By conducting this study, valuable insights can be gained into how IMDb ratings are shaped by user votes. Additionally, the findings may help predict movie ratings based on voting patterns, improving future recommendations and audience engagement strategies
Step 1 : Importing Required Libraries and packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
print("All Set!")
Step 2 : Choosing the dataset and processing the dataset 😀
amazon_tvshows = pd.read_csv('/Users/maheshg/Amazon Prime TV Shows/titles.csv')
amazon_tvshows.head()

Step 3 : Replacing the NaN values and separating the nuemrical and categorical variables
numerical =[]
categorical = []
for col in amazon_tvshows.columns:
if amazon_tvshows[col].dtype == 'float64' or amazon_tvshows[col].dtype == 'int64':
numerical.append(col)
# numerical.append(col)
else :
categorical.append(col)
numerical

categorical

Step 4 : Checking if there is any null values presented in the dataset:
print("Number of missing values in each numerical column \n")
for c in numerical:
print(c,":",amazon_tvshows[c].isnull().sum())

print("Number of missing values in each categorical column \n")
for c in categorical:
print(c,":",amazon_tvshows[c].isnull().sum())

amazon_tvshows.describe()

type(amazon_tvshows)

amazon_tvshows.isna().sum()

Note : Different ways to replace NA values
amazon_tvshows['imdb_score'].fillna(amazon_tvshows['imdb_score'].mean(),inplace=True)
amazon_tvshows['imdb_votes'].fillna(amazon_tvshows['imdb_votes'].mean(),inplace=True)
amazon_tvshows['imdb_score'].isna().sum()
amazon_tvshows['imdb_votes'].isna().sum()
amazon_tvshows.describe()

amazon_tvshows['imdb_votes'].describe()

amazon_tvshows['imdb_score'].describe()
amazon_tvshows = amazon_tvshows.dropna()
Step 6 : Sort the values by IMDB Score

Step 7 : Assessing the effectiveness of popularity index values for movies
amazon_tvshows.plot.scatter(x='imdb_score', y='imdb_votes', figsize=(10, 6), color='blue', alpha=0.5)

Step 8 : Perform the heat map for the above dataset
pop_ind_corr = amazon_tvshows['imdb_score'].corr(amazon_tvshows['imdb_votes'])
print("Correlation between imdb_score and imdb_votes: ", pop_ind_corr)
plt.figure(figsize=(10, 6))
sns.scatterplot(x='imdb_score', y='imdb_votes', data=amazon_tvshows, alpha=0.5)
plt.title('IMDB Score vs IMDB Votes')
plt.xlabel('IMDB Score')
plt.ylabel('IMDB Votes')
plt.show()
# Splitting the dataset into training and testing sets
X = amazon_tvshows.drop(['imdb_score', 'imdb_votes'], axis=1)
y = amazon_tvshows['imdb_score']

plt.figure(figsize=(12, 8))
corr_matrix = amazon_tvshows[numerical].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

Step 9 : Distribution of Popularity Index
plt.figure(figsize=(10, 6))
sns.histplot(amazon_tvshows['tmdb_popularity'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of TMDB Popularity')
plt.xlabel('TMDB Popularity')
plt.ylabel('Count')
plt.show()

plt.figure(figsize=(10, 6))
sns.histplot(amazon_tvshows['release_year'], bins=30, kde=False, color='orange')
plt.title('Distribution of TV Shows by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()

from sklearn.cluster import KMeans
# Reshape release_year for clustering
release_year_values = amazon_tvshows['release_year'].values.reshape(-1, 1)
# Fit KMeans with 3 clusters (you can change n_clusters as needed)
kmeans = KMeans(n_clusters=3, random_state=42)
amazon_tvshows['release_year_cluster'] = kmeans.fit_predict(release_year_values)
# Show cluster centers
print("Cluster centers (release years):", kmeans.cluster_centers_.flatten())
amazon_tvshows[['release_year', 'release_year_cluster']].head()

Step 9 :Creation of Elbow Method using K means cluster
inertia = []
K = range(1, 11)
for k in K:
kmeans_model = KMeans(n_clusters=k, random_state=42)
kmeans_model.fit(release_year_values)
inertia.append(kmeans_model.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k (release_year)')
plt.show()

from sklearn.linear_model import LinearRegression
# Prepare the data
X_votes = amazon_tvshows[['imdb_votes']]
y_scores = amazon_tvshows['imdb_score']
# Fit the linear regression model
linreg = LinearRegression()
linreg.fit(X_votes, y_scores)
# Print the coefficients
print("Intercept:", linreg.intercept_)
print("Slope:", linreg.coef_[0])

Step 10 : Perform the Accuracy Level and Score Details
from sklearn.metrics import classification_report, accuracy_score
# Predictive modeling using RandomForestClassifier to predict 'release_year_cluster'
# Prepare features and target
features = X.select_dtypes(include=[np.number]).drop(columns=['release_year_cluster'], errors='ignore')
target = amazon_tvshows['release_year_cluster']
# Split data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
# Train RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
# Predict and evaluate
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
