Exploratory Data Analysis -IMDB Dataset - Votes vs Score

Statistical Analysis using Python programming with detailed information

Problem Statement and Data Set:

The problem statement focuses on analyzing IMDb scores by leveraging details from IMDb votes found in a Kaggle dataset. IMDb scores represent the overall user rating of a movie, typically ranging from 1 to 10, based on individual user votes. These votes capture audience reactions, preferences, and perceptions of a film’s quality.

The objective of this analysis is to examine the relationship between IMDb votes and the assigned IMDb scores, identifying key patterns, trends, and influencing factors that contribute to a movie’s final rating. This includes:

Evaluating the distribution of votes across different movies and genres.

Assessing how the number of votes impacts the IMDb score—whether higher vote counts lead to more stable ratings.

Identifying anomalies, such as movies with exceptionally high or low scores relative to their vote distribution.

Understanding possible biases in user ratings based on factors like genre, popularity, or external reviews.

Exploring statistical techniques, including correlation analysis and predictive modeling, to estimate IMDb scores from vote-related metrics.

By conducting this study, valuable insights can be gained into how IMDb ratings are shaped by user votes. Additionally, the findings may help predict movie ratings based on voting patterns, improving future recommendations and audience engagement strategies

Step 1 : Importing Required Libraries and packages

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

print("All Set!")

Step 2 : Choosing the dataset and processing the dataset 😀

amazon_tvshows = pd.read_csv('/Users/maheshg/Amazon Prime TV Shows/titles.csv')
amazon_tvshows.head()

Step 3 : Replacing the NaN values and separating the nuemrical and categorical variables

numerical =[]
categorical = []


for col in amazon_tvshows.columns:
    if amazon_tvshows[col].dtype == 'float64' or amazon_tvshows[col].dtype == 'int64':
        numerical.append(col)
    # numerical.append(col)
else : 
    categorical.append(col)
numerical
categorical

Step 4 : Checking if there is any null values presented in the dataset:

print("Number of missing values in each numerical column \n")
for c in numerical:
    print(c,":",amazon_tvshows[c].isnull().sum())
print("Number of missing values in each categorical column \n")
for c in categorical:
    print(c,":",amazon_tvshows[c].isnull().sum())
amazon_tvshows.describe()
type(amazon_tvshows)
amazon_tvshows.isna().sum()

Note : Different ways to replace NA values

amazon_tvshows['imdb_score'].fillna(amazon_tvshows['imdb_score'].mean(),inplace=True)
amazon_tvshows['imdb_votes'].fillna(amazon_tvshows['imdb_votes'].mean(),inplace=True)
amazon_tvshows['imdb_score'].isna().sum()
amazon_tvshows['imdb_votes'].isna().sum()
amazon_tvshows.describe()
amazon_tvshows['imdb_votes'].describe()
amazon_tvshows['imdb_score'].describe()
amazon_tvshows = amazon_tvshows.dropna()

Step 6 : Sort the values by IMDB Score

Step 7 : Assessing the effectiveness of popularity index values for movies

amazon_tvshows.plot.scatter(x='imdb_score', y='imdb_votes', figsize=(10, 6), color='blue', alpha=0.5)

Step 8 : Perform the heat map for the above dataset

pop_ind_corr = amazon_tvshows['imdb_score'].corr(amazon_tvshows['imdb_votes'])
print("Correlation between imdb_score and imdb_votes: ", pop_ind_corr)
plt.figure(figsize=(10, 6))
sns.scatterplot(x='imdb_score', y='imdb_votes', data=amazon_tvshows, alpha=0.5)
plt.title('IMDB Score vs IMDB Votes')
plt.xlabel('IMDB Score')
plt.ylabel('IMDB Votes')
plt.show()
# Splitting the dataset into training and testing sets
X = amazon_tvshows.drop(['imdb_score', 'imdb_votes'], axis=1)
y = amazon_tvshows['imdb_score']        
plt.figure(figsize=(12, 8))
corr_matrix = amazon_tvshows[numerical].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

Step 9 : Distribution of Popularity Index

plt.figure(figsize=(10, 6))
sns.histplot(amazon_tvshows['tmdb_popularity'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of TMDB Popularity')
plt.xlabel('TMDB Popularity')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(10, 6))
sns.histplot(amazon_tvshows['release_year'], bins=30, kde=False, color='orange')
plt.title('Distribution of TV Shows by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()
from sklearn.cluster import KMeans

# Reshape release_year for clustering
release_year_values = amazon_tvshows['release_year'].values.reshape(-1, 1)

# Fit KMeans with 3 clusters (you can change n_clusters as needed)
kmeans = KMeans(n_clusters=3, random_state=42)
amazon_tvshows['release_year_cluster'] = kmeans.fit_predict(release_year_values)

# Show cluster centers
print("Cluster centers (release years):", kmeans.cluster_centers_.flatten())
amazon_tvshows[['release_year', 'release_year_cluster']].head()

Step 9 :Creation of Elbow Method using K means cluster


inertia = []
K = range(1, 11)
for k in K:
    kmeans_model = KMeans(n_clusters=k, random_state=42)
    kmeans_model.fit(release_year_values)
    inertia.append(kmeans_model.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k (release_year)')
plt.show()
from sklearn.linear_model import LinearRegression

# Prepare the data
X_votes = amazon_tvshows[['imdb_votes']]
y_scores = amazon_tvshows['imdb_score']

# Fit the linear regression model
linreg = LinearRegression()
linreg.fit(X_votes, y_scores)

# Print the coefficients
print("Intercept:", linreg.intercept_)
print("Slope:", linreg.coef_[0])

Step 10 : Perform the Accuracy Level and Score Details

from sklearn.metrics import classification_report, accuracy_score

# Predictive modeling using RandomForestClassifier to predict 'release_year_cluster'

# Prepare features and target
features = X.select_dtypes(include=[np.number]).drop(columns=['release_year_cluster'], errors='ignore')
target = amazon_tvshows['release_year_cluster']

# Split data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Train RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))