Descriptive Statistics IMDB Dataset

Descriptive statistics using R programming with detailed information

Problem Statement and Data Set:

The problem statement focuses on analyzing IMDb scores by leveraging details from IMDb votes found in a Kaggle dataset. IMDb scores represent the overall user rating of a movie, typically ranging from 1 to 10, based on individual user votes. These votes capture audience reactions, preferences, and perceptions of a film’s quality.

The objective of this analysis is to examine the relationship between IMDb votes and the assigned IMDb scores, identifying key patterns, trends, and influencing factors that contribute to a movie’s final rating. This includes:

Evaluating the distribution of votes across different movies and genres.

Assessing how the number of votes impacts the IMDb score—whether higher vote counts lead to more stable ratings.

Identifying anomalies, such as movies with exceptionally high or low scores relative to their vote distribution.

Understanding possible biases in user ratings based on factors like genre, popularity, or external reviews.

Exploring statistical techniques, including correlation analysis and predictive modeling, to estimate IMDb scores from vote-related metrics.

By conducting this study, valuable insights can be gained into how IMDb ratings are shaped by user votes. Additionally, the findings may help predict movie ratings based on voting patterns, improving future recommendations and audience engagement strategies

Step 1 : Importing Required Libraries and packages

dataset <- read.csv('/Users/maheshg/Dropbox/Sample Datasets Kaggle/Amazon Prime TV Shows/titles.csv')
clean_dataset <- na.omit(dataset)
# print(clean_dataset)
colnames(clean_dataset)
##  [1] "id"                   "title"                "type"                
##  [4] "description"          "release_year"         "age_certification"   
##  [7] "runtime"              "genres"               "production_countries"
## [10] "seasons"              "imdb_id"              "imdb_score"          
## [13] "imdb_votes"           "tmdb_popularity"      "tmdb_score"
str(clean_dataset)
## 'data.frame':    897 obs. of  15 variables:
##  $ id                  : chr  "ts20945" "ts55748" "ts20005" "ts42867" ...
##  $ title               : chr  "The Three Stooges" "What's My Line?" "I Love Lucy" "Mister Rogers' Neighborhood" ...
##  $ type                : chr  "SHOW" "SHOW" "SHOW" "SHOW" ...
##  $ description         : chr  "The Three Stooges were an American vaudeville and comedy team active from 1922 until 1970, best known for their"| __truncated__ "Four panelists must determine guests' occupations - and, in the case of famous guests, while blindfolded, their"| __truncated__ "Cuban Bandleader Ricky Ricardo would be happy if his wife Lucy would just be a housewife. Instead she tries con"| __truncated__ "Mister Rogers' Neighborhood is an American children's television series that was created and hosted by namesake"| __truncated__ ...
##  $ release_year        : int  1934 1950 1951 1968 1971 1966 1974 1972 1975 1968 ...
##  $ age_certification   : chr  "TV-PG" "" "TV-G" "TV-Y" ...
##  $ runtime             : int  19 30 30 29 23 24 57 30 25 54 ...
##  $ genres              : chr  "['comedy', 'family', 'animation', 'action', 'fantasy', 'horror']" "['reality', 'family']" "['comedy', 'family']" "['fantasy', 'music', 'family']" ...
##  $ production_countries: chr  "['US']" "['US']" "['US']" "['US']" ...
##  $ seasons             : num  26 18 9 31 6 12 49 6 11 22 ...
##  $ imdb_id             : chr  "tt0850645" "tt1036980" "tt0043208" "tt0062588" ...
##  $ imdb_score          : num  8.6 8.6 8.5 8.7 7.9 8.1 8.7 7.9 7.5 8.3 ...
##  $ imdb_votes          : num  1092 1563 25944 8675 2116 ...
##  $ tmdb_popularity     : num  15.42 87.39 17.09 8.75 45.83 ...
##  $ tmdb_score          : num  7.6 6.9 8.1 4.7 8 6.7 7.1 7.5 7.3 8.5 ...
##  - attr(*, "na.action")= 'omit' Named int [1:8974] 2 3 4 5 6 7 8 9 10 11 ...
##   ..- attr(*, "names")= chr [1:8974] "2" "3" "4" "5" ...
###Finding the values exist in the dataset? 
which(is.na(clean_dataset))
sum(is.na(clean_dataset))
## [1] 0
anyNA(clean_dataset) ### This is cleaned dataset omitted na values 
## [1] FALSE
anyNA(dataset) ### There are some NA values exists in the maindataset 
## [1] TRUE
summary(clean_dataset)
##       id               title               type           description       
##  Length:897         Length:897         Length:897         Length:897        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   release_year  age_certification     runtime          genres         
##  Min.   :1934   Length:897         Min.   :  1.00   Length:897        
##  1st Qu.:2008   Class :character   1st Qu.: 24.00   Class :character  
##  Median :2015   Mode  :character   Median : 39.00   Mode  :character  
##  Mean   :2011                      Mean   : 37.68                     
##  3rd Qu.:2018                      3rd Qu.: 49.00                     
##  Max.   :2022                      Max.   :153.00                     
##  production_countries    seasons         imdb_id            imdb_score  
##  Length:897           Min.   : 1.000   Length:897         Min.   :2.20  
##  Class :character     1st Qu.: 1.000   Class :character   1st Qu.:6.60  
##  Mode  :character     Median : 2.000   Mode  :character   Median :7.30  
##                       Mean   : 3.343                      Mean   :7.15  
##                       3rd Qu.: 4.000                      3rd Qu.:7.90  
##                       Max.   :49.000                      Max.   :9.50  
##    imdb_votes     tmdb_popularity      tmdb_score    
##  Min.   :     5   Min.   :  0.0002   Min.   : 0.800  
##  1st Qu.:   256   1st Qu.:  3.0290   1st Qu.: 6.700  
##  Median :   989   Median :  7.2600   Median : 7.500  
##  Mean   : 13186   Mean   : 18.4140   Mean   : 7.301  
##  3rd Qu.:  4807   3rd Qu.: 17.7250   3rd Qu.: 8.000  
##  Max.   :711566   Max.   :951.8630   Max.   :10.000
str(clean_dataset)
## 'data.frame':    897 obs. of  15 variables:
##  $ id                  : chr  "ts20945" "ts55748" "ts20005" "ts42867" ...
##  $ title               : chr  "The Three Stooges" "What's My Line?" "I Love Lucy" "Mister Rogers' Neighborhood" ...
##  $ type                : chr  "SHOW" "SHOW" "SHOW" "SHOW" ...
##  $ description         : chr  "The Three Stooges were an American vaudeville and comedy team active from 1922 until 1970, best known for their"| __truncated__ "Four panelists must determine guests' occupations - and, in the case of famous guests, while blindfolded, their"| __truncated__ "Cuban Bandleader Ricky Ricardo would be happy if his wife Lucy would just be a housewife. Instead she tries con"| __truncated__ "Mister Rogers' Neighborhood is an American children's television series that was created and hosted by namesake"| __truncated__ ...
##  $ release_year        : int  1934 1950 1951 1968 1971 1966 1974 1972 1975 1968 ...
##  $ age_certification   : chr  "TV-PG" "" "TV-G" "TV-Y" ...
##  $ runtime             : int  19 30 30 29 23 24 57 30 25 54 ...
##  $ genres              : chr  "['comedy', 'family', 'animation', 'action', 'fantasy', 'horror']" "['reality', 'family']" "['comedy', 'family']" "['fantasy', 'music', 'family']" ...
##  $ production_countries: chr  "['US']" "['US']" "['US']" "['US']" ...
##  $ seasons             : num  26 18 9 31 6 12 49 6 11 22 ...
##  $ imdb_id             : chr  "tt0850645" "tt1036980" "tt0043208" "tt0062588" ...
##  $ imdb_score          : num  8.6 8.6 8.5 8.7 7.9 8.1 8.7 7.9 7.5 8.3 ...
##  $ imdb_votes          : num  1092 1563 25944 8675 2116 ...
##  $ tmdb_popularity     : num  15.42 87.39 17.09 8.75 45.83 ...
##  $ tmdb_score          : num  7.6 6.9 8.1 4.7 8 6.7 7.1 7.5 7.3 8.5 ...
##  - attr(*, "na.action")= 'omit' Named int [1:8974] 2 3 4 5 6 7 8 9 10 11 ...
##   ..- attr(*, "names")= chr [1:8974] "2" "3" "4" "5" ...
by(clean_dataset,clean_dataset$release_year,summary)
stat.desc(clean_dataset$imdb_score)
stat.desc(clean_dataset$imdb_votes)
### Coefficient Variation : 
sd(clean_dataset$imdb_score) / mean(clean_dataset$imdb_score)
sd(clean_dataset$imdb_votes) / mean(clean_dataset$imdb_votes)
tab1 <- table(clean_dataset$imdb_votes)
print(tab1)
sort(table(clean_dataset$imdb_votes),decreasing = TRUE)
###Identifying the correlation between two variables imdb scores and imdb votes 
cor(clean_dataset$imdb_score,clean_dataset$imdb_votes)
## [1] 0.2314977
###Creating contingency table 
median(clean_dataset$imdb_votes)
## [1] 989
mean(clean_dataset$imdb_votes)
## [1] 13185.51
clean_dataset$imdb_votes_size <- ifelse(clean_dataset$imdb_votes < median(clean_dataset$imdb_votes),
                          "average","beyond average")
table(clean_dataset$imdb_votes_size)


##        average beyond average 
##            448            449
clean_dataset$imdb_score_size <- ifelse(clean_dataset$imdb_score < median(clean_dataset$imdb_score),
                          "average","beyond average")
table(clean_dataset$imdb_score_size )
## 
##        average beyond average 
##            431            466
table(clean_dataset$imdb_votes_size,clean_dataset$imdb_score_size )
##                 
##                  average beyond average
##   average            259            189
##   beyond average     172            277
prop.table(table(clean_dataset$imdb_votes_size,clean_dataset$imdb_score_size ))
##                 
##                    average beyond average
##   average        0.2887402      0.2107023
##   beyond average 0.1917503      0.3088071
round(prop.table(table(clean_dataset$imdb_votes_size,clean_dataset$imdb_score_size ),1),2) ### Percentage by Row
##                 
##                  average beyond average
##   average           0.58           0.42
##   beyond average    0.38           0.62
round(prop.table(table(clean_dataset$imdb_votes_size,clean_dataset$imdb_score_size ),2),2) ### Percentage by Column
##                 
##                  average beyond average
##   average           0.60           0.41
##   beyond average    0.40           0.59
## 'data.frame':    897 obs. of  17 variables:
##  $ id                  : chr  "ts20945" "ts55748" "ts20005" "ts42867" ...
##  $ title               : chr  "The Three Stooges" "What's My Line?" "I Love Lucy" "Mister Rogers' Neighborhood" ...
##  $ type                : chr  "SHOW" "SHOW" "SHOW" "SHOW" ...
##  $ description         : chr  "The Three Stooges were an American vaudeville and comedy team active from 1922 until 1970, best known for their"| __truncated__ "Four panelists must determine guests' occupations - and, in the case of famous guests, while blindfolded, their"| __truncated__ "Cuban Bandleader Ricky Ricardo would be happy if his wife Lucy would just be a housewife. Instead she tries con"| __truncated__ "Mister Rogers' Neighborhood is an American children's television series that was created and hosted by namesake"| __truncated__ ...
##  $ release_year        : int  1934 1950 1951 1968 1971 1966 1974 1972 1975 1968 ...
##  $ age_certification   : chr  "TV-PG" "" "TV-G" "TV-Y" ...
##  $ runtime             : int  19 30 30 29 23 24 57 30 25 54 ...
##  $ genres              : chr  "['comedy', 'family', 'animation', 'action', 'fantasy', 'horror']" "['reality', 'family']" "['comedy', 'family']" "['fantasy', 'music', 'family']" ...
##  $ production_countries: chr  "['US']" "['US']" "['US']" "['US']" ...
##  $ seasons             : num  26 18 9 31 6 12 49 6 11 22 ...
##  $ imdb_id             : chr  "tt0850645" "tt1036980" "tt0043208" "tt0062588" ...
##  $ imdb_score          : num  8.6 8.6 8.5 8.7 7.9 8.1 8.7 7.9 7.5 8.3 ...
##  $ imdb_votes          : num  1092 1563 25944 8675 2116 ...
##  $ tmdb_popularity     : num  15.42 87.39 17.09 8.75 45.83 ...
##  $ tmdb_score          : num  7.6 6.9 8.1 4.7 8 6.7 7.1 7.5 7.3 8.5 ...
##  $ imdb_votes_size     : chr  "beyond average" "beyond average" "beyond average" "beyond average" ...
##  $ imdb_score_size     : chr  "beyond average" "beyond average" "beyond average" "beyond average" ...
##  - attr(*, "na.action")= 'omit' Named int [1:8974] 2 3 4 5 6 7 8 9 10 11 ...
##   ..- attr(*, "names")= chr [1:8974] "2" "3" "4" "5" ...
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.