- Business & Data Research
- Posts
- Descriptive Statistics IMDB Dataset
Descriptive Statistics IMDB Dataset
Descriptive statistics using R programming with detailed information

Problem Statement and Data Set:
The problem statement focuses on analyzing IMDb scores by leveraging details from IMDb votes found in a Kaggle dataset. IMDb scores represent the overall user rating of a movie, typically ranging from 1 to 10, based on individual user votes. These votes capture audience reactions, preferences, and perceptions of a film’s quality.
The objective of this analysis is to examine the relationship between IMDb votes and the assigned IMDb scores, identifying key patterns, trends, and influencing factors that contribute to a movie’s final rating. This includes:
Evaluating the distribution of votes across different movies and genres.
Assessing how the number of votes impacts the IMDb score—whether higher vote counts lead to more stable ratings.
Identifying anomalies, such as movies with exceptionally high or low scores relative to their vote distribution.
Understanding possible biases in user ratings based on factors like genre, popularity, or external reviews.
Exploring statistical techniques, including correlation analysis and predictive modeling, to estimate IMDb scores from vote-related metrics.
By conducting this study, valuable insights can be gained into how IMDb ratings are shaped by user votes. Additionally, the findings may help predict movie ratings based on voting patterns, improving future recommendations and audience engagement strategies
Step 1 : Importing Required Libraries and packages
dataset <- read.csv('/Users/maheshg/Dropbox/Sample Datasets Kaggle/Amazon Prime TV Shows/titles.csv')
clean_dataset <- na.omit(dataset)
# print(clean_dataset)
colnames(clean_dataset)
## [1] "id" "title" "type"
## [4] "description" "release_year" "age_certification"
## [7] "runtime" "genres" "production_countries"
## [10] "seasons" "imdb_id" "imdb_score"
## [13] "imdb_votes" "tmdb_popularity" "tmdb_score"
str(clean_dataset)
## 'data.frame': 897 obs. of 15 variables:
## $ id : chr "ts20945" "ts55748" "ts20005" "ts42867" ...
## $ title : chr "The Three Stooges" "What's My Line?" "I Love Lucy" "Mister Rogers' Neighborhood" ...
## $ type : chr "SHOW" "SHOW" "SHOW" "SHOW" ...
## $ description : chr "The Three Stooges were an American vaudeville and comedy team active from 1922 until 1970, best known for their"| __truncated__ "Four panelists must determine guests' occupations - and, in the case of famous guests, while blindfolded, their"| __truncated__ "Cuban Bandleader Ricky Ricardo would be happy if his wife Lucy would just be a housewife. Instead she tries con"| __truncated__ "Mister Rogers' Neighborhood is an American children's television series that was created and hosted by namesake"| __truncated__ ...
## $ release_year : int 1934 1950 1951 1968 1971 1966 1974 1972 1975 1968 ...
## $ age_certification : chr "TV-PG" "" "TV-G" "TV-Y" ...
## $ runtime : int 19 30 30 29 23 24 57 30 25 54 ...
## $ genres : chr "['comedy', 'family', 'animation', 'action', 'fantasy', 'horror']" "['reality', 'family']" "['comedy', 'family']" "['fantasy', 'music', 'family']" ...
## $ production_countries: chr "['US']" "['US']" "['US']" "['US']" ...
## $ seasons : num 26 18 9 31 6 12 49 6 11 22 ...
## $ imdb_id : chr "tt0850645" "tt1036980" "tt0043208" "tt0062588" ...
## $ imdb_score : num 8.6 8.6 8.5 8.7 7.9 8.1 8.7 7.9 7.5 8.3 ...
## $ imdb_votes : num 1092 1563 25944 8675 2116 ...
## $ tmdb_popularity : num 15.42 87.39 17.09 8.75 45.83 ...
## $ tmdb_score : num 7.6 6.9 8.1 4.7 8 6.7 7.1 7.5 7.3 8.5 ...
## - attr(*, "na.action")= 'omit' Named int [1:8974] 2 3 4 5 6 7 8 9 10 11 ...
## ..- attr(*, "names")= chr [1:8974] "2" "3" "4" "5" ...
###Finding the values exist in the dataset?
which(is.na(clean_dataset))
sum(is.na(clean_dataset))
## [1] 0
anyNA(clean_dataset) ### This is cleaned dataset omitted na values
## [1] FALSE
anyNA(dataset) ### There are some NA values exists in the maindataset
## [1] TRUE
summary(clean_dataset)
## id title type description
## Length:897 Length:897 Length:897 Length:897
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## release_year age_certification runtime genres
## Min. :1934 Length:897 Min. : 1.00 Length:897
## 1st Qu.:2008 Class :character 1st Qu.: 24.00 Class :character
## Median :2015 Mode :character Median : 39.00 Mode :character
## Mean :2011 Mean : 37.68
## 3rd Qu.:2018 3rd Qu.: 49.00
## Max. :2022 Max. :153.00
## production_countries seasons imdb_id imdb_score
## Length:897 Min. : 1.000 Length:897 Min. :2.20
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:6.60
## Mode :character Median : 2.000 Mode :character Median :7.30
## Mean : 3.343 Mean :7.15
## 3rd Qu.: 4.000 3rd Qu.:7.90
## Max. :49.000 Max. :9.50
## imdb_votes tmdb_popularity tmdb_score
## Min. : 5 Min. : 0.0002 Min. : 0.800
## 1st Qu.: 256 1st Qu.: 3.0290 1st Qu.: 6.700
## Median : 989 Median : 7.2600 Median : 7.500
## Mean : 13186 Mean : 18.4140 Mean : 7.301
## 3rd Qu.: 4807 3rd Qu.: 17.7250 3rd Qu.: 8.000
## Max. :711566 Max. :951.8630 Max. :10.000
str(clean_dataset)
## 'data.frame': 897 obs. of 15 variables:
## $ id : chr "ts20945" "ts55748" "ts20005" "ts42867" ...
## $ title : chr "The Three Stooges" "What's My Line?" "I Love Lucy" "Mister Rogers' Neighborhood" ...
## $ type : chr "SHOW" "SHOW" "SHOW" "SHOW" ...
## $ description : chr "The Three Stooges were an American vaudeville and comedy team active from 1922 until 1970, best known for their"| __truncated__ "Four panelists must determine guests' occupations - and, in the case of famous guests, while blindfolded, their"| __truncated__ "Cuban Bandleader Ricky Ricardo would be happy if his wife Lucy would just be a housewife. Instead she tries con"| __truncated__ "Mister Rogers' Neighborhood is an American children's television series that was created and hosted by namesake"| __truncated__ ...
## $ release_year : int 1934 1950 1951 1968 1971 1966 1974 1972 1975 1968 ...
## $ age_certification : chr "TV-PG" "" "TV-G" "TV-Y" ...
## $ runtime : int 19 30 30 29 23 24 57 30 25 54 ...
## $ genres : chr "['comedy', 'family', 'animation', 'action', 'fantasy', 'horror']" "['reality', 'family']" "['comedy', 'family']" "['fantasy', 'music', 'family']" ...
## $ production_countries: chr "['US']" "['US']" "['US']" "['US']" ...
## $ seasons : num 26 18 9 31 6 12 49 6 11 22 ...
## $ imdb_id : chr "tt0850645" "tt1036980" "tt0043208" "tt0062588" ...
## $ imdb_score : num 8.6 8.6 8.5 8.7 7.9 8.1 8.7 7.9 7.5 8.3 ...
## $ imdb_votes : num 1092 1563 25944 8675 2116 ...
## $ tmdb_popularity : num 15.42 87.39 17.09 8.75 45.83 ...
## $ tmdb_score : num 7.6 6.9 8.1 4.7 8 6.7 7.1 7.5 7.3 8.5 ...
## - attr(*, "na.action")= 'omit' Named int [1:8974] 2 3 4 5 6 7 8 9 10 11 ...
## ..- attr(*, "names")= chr [1:8974] "2" "3" "4" "5" ...
by(clean_dataset,clean_dataset$release_year,summary)
stat.desc(clean_dataset$imdb_score)
stat.desc(clean_dataset$imdb_votes)
### Coefficient Variation :
sd(clean_dataset$imdb_score) / mean(clean_dataset$imdb_score)
sd(clean_dataset$imdb_votes) / mean(clean_dataset$imdb_votes)
tab1 <- table(clean_dataset$imdb_votes)
print(tab1)
sort(table(clean_dataset$imdb_votes),decreasing = TRUE)
###Identifying the correlation between two variables imdb scores and imdb votes
cor(clean_dataset$imdb_score,clean_dataset$imdb_votes)
## [1] 0.2314977
###Creating contingency table
median(clean_dataset$imdb_votes)
## [1] 989
mean(clean_dataset$imdb_votes)
## [1] 13185.51
clean_dataset$imdb_votes_size <- ifelse(clean_dataset$imdb_votes < median(clean_dataset$imdb_votes),
"average","beyond average")
table(clean_dataset$imdb_votes_size)
## average beyond average
## 448 449
clean_dataset$imdb_score_size <- ifelse(clean_dataset$imdb_score < median(clean_dataset$imdb_score),
"average","beyond average")
table(clean_dataset$imdb_score_size )
##
## average beyond average
## 431 466
table(clean_dataset$imdb_votes_size,clean_dataset$imdb_score_size )
##
## average beyond average
## average 259 189
## beyond average 172 277
prop.table(table(clean_dataset$imdb_votes_size,clean_dataset$imdb_score_size ))
##
## average beyond average
## average 0.2887402 0.2107023
## beyond average 0.1917503 0.3088071
round(prop.table(table(clean_dataset$imdb_votes_size,clean_dataset$imdb_score_size ),1),2) ### Percentage by Row
##
## average beyond average
## average 0.58 0.42
## beyond average 0.38 0.62
round(prop.table(table(clean_dataset$imdb_votes_size,clean_dataset$imdb_score_size ),2),2) ### Percentage by Column
##
## average beyond average
## average 0.60 0.41
## beyond average 0.40 0.59





## 'data.frame': 897 obs. of 17 variables:
## $ id : chr "ts20945" "ts55748" "ts20005" "ts42867" ...
## $ title : chr "The Three Stooges" "What's My Line?" "I Love Lucy" "Mister Rogers' Neighborhood" ...
## $ type : chr "SHOW" "SHOW" "SHOW" "SHOW" ...
## $ description : chr "The Three Stooges were an American vaudeville and comedy team active from 1922 until 1970, best known for their"| __truncated__ "Four panelists must determine guests' occupations - and, in the case of famous guests, while blindfolded, their"| __truncated__ "Cuban Bandleader Ricky Ricardo would be happy if his wife Lucy would just be a housewife. Instead she tries con"| __truncated__ "Mister Rogers' Neighborhood is an American children's television series that was created and hosted by namesake"| __truncated__ ...
## $ release_year : int 1934 1950 1951 1968 1971 1966 1974 1972 1975 1968 ...
## $ age_certification : chr "TV-PG" "" "TV-G" "TV-Y" ...
## $ runtime : int 19 30 30 29 23 24 57 30 25 54 ...
## $ genres : chr "['comedy', 'family', 'animation', 'action', 'fantasy', 'horror']" "['reality', 'family']" "['comedy', 'family']" "['fantasy', 'music', 'family']" ...
## $ production_countries: chr "['US']" "['US']" "['US']" "['US']" ...
## $ seasons : num 26 18 9 31 6 12 49 6 11 22 ...
## $ imdb_id : chr "tt0850645" "tt1036980" "tt0043208" "tt0062588" ...
## $ imdb_score : num 8.6 8.6 8.5 8.7 7.9 8.1 8.7 7.9 7.5 8.3 ...
## $ imdb_votes : num 1092 1563 25944 8675 2116 ...
## $ tmdb_popularity : num 15.42 87.39 17.09 8.75 45.83 ...
## $ tmdb_score : num 7.6 6.9 8.1 4.7 8 6.7 7.1 7.5 7.3 8.5 ...
## $ imdb_votes_size : chr "beyond average" "beyond average" "beyond average" "beyond average" ...
## $ imdb_score_size : chr "beyond average" "beyond average" "beyond average" "beyond average" ...
## - attr(*, "na.action")= 'omit' Named int [1:8974] 2 3 4 5 6 7 8 9 10 11 ...
## ..- attr(*, "names")= chr [1:8974] "2" "3" "4" "5" ...
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.








