Factors Influencing the Success of a Movie

Many of us have been an enthusiastic fan of movies and like to explore great movies through looking at different film ratings and reviews websites such as IMDb, Rotten Tomatoes, etc. While reviewing the top movies list, we might be wondering, what are the primary factors that influence a movie’s success? Is it budget, box office, language, or movie genre?

Hence, we, the three movie lovers and data lovers too, decided to conduct a statistical analysis regarding the influential factors of a movie’s success. We looked at IMDb Top 500 Greatest movies of all time and created our dataset. Our categorical variables consist of information on certificate, genre, country, and language. ​Numerical variables consist of information on duration, rating, vote, gross, and budget. There are 9 different certificates, 21 languages and 29 countries in total. Some movies’ gross and budget information were not available and we ignored those movies.

There are some sources of noise in the dataset. For example, some foreign movies have gross and budget in foreign currencies and we converted these foreign currencies to USD. It might be inaccurate since exchange rates are always fluctuating and it might not accurately reflect the exact gross and budget for foreign movies. There are also some old movies with gross and budget collected a long time ago.

The data is partially related to people as ratings are crowd-sourced opinions. Ratings can reflect how the public view the qualities of movies. However, there are certain groups of people who are more likely to rate. For example, people who really like or dislike this particular movie or casting are more likely to rate to express their strong opinions. People who are movie lovers are also more likely to rate but in a more objective way.

We included 500 movies with 9 independent variables and one dependent variable “Rating.” We calculated the total count, mean, standard deviation, minimum, maximum, 25th, 50th, and 75th percentile.

Figure: Descriptive summary of data
Dependent Variable: Rating

Movie rating is our dependent variable and the distribution of movie rating is normally distributed.


Figure: Distribution of Rating

Independent Variables

  • Certificate
Out of 500 movies, there are 199 Restricted, 100 Not-Rated, 80 Parental Guidance Suggested, 63 Parents Strongly Cautioned, and 32 General Audience movies. It’s reasonable that a large amount of the movies are Restricted, because good movies usually use violence, sexual and dark elements as art expressions to convey deeper meanings. On the contrary, not restricted movies are usually more widely accepted and viewed by children, so it might not contain adult plots.
  • Duration
The distribution of Durations shows that most top movies’ duration is concentrated around 120 minutes.

Figure: Distribution of Duration
  • Year
Figure: Year Distribution
By plotting Year distribution, we found that there is a sharp increase after 1975 as the movie industry was flourishing. There was also a drastic decrease around 1950, possibly caused by World War II. This graph shows that the development of the movie industry is highly determined by monetary investment and social stability. Political and economic turmoil affect movie production.

  • Vote
The distribution graph of Vote is left skewed, which shows that people actually do not vote that much even for the top 500 movies.


Figure: Vote Distribution

  • Gross & Budget
Gross and Budget have very similar distribution and graphs. They are both left skewed. For gross, most data is concentrated in 0-200 million, whereas budget is concentrated around 0-25 million.
Figure: Gross (left) and Budget (right) Distribution
  • Country & Language
From the graph below, we can tell that most of the top 500 movies were produced in the USA with English. As one of the most widely-spoken languages, English movies are easier to spread out worldwide.


Figure: Country Distribution

Figure: Language Distribution

  • Genre
Although there are 22 genres, over half of the top movies are drama.

Figure: Genre Distribution

Linear Regression: Independent Variables vs. Rating
  • Vote and Rating
Figure: Vote vs. Rating
We attempted to fit the relationships between each independent variable and the dependent variable Rating. Some linear relationships are more obvious than others, such as Vote, Duration, and Year. The slope of the best-fitted line between Vote and Rating is 0.000000461627719, meaning that generally every 1,000,000 increase in vote count correlates with 0.4616 increases in rating. The increased rating is most likely the result of higher popularity of a movie, resulting in a more positive rating of the movie overall.


  • Gross and Rating
There is no linear relationship between gross and rating, so higher rating does not necessarily correlate with higher profits for a movie.

Figure: Gross vs. Rating
  • Duration and Rating
There is a linear relationship between duration and rating. The slope of the best-fitted line is 0.00254342
Figure: Duration vs. Rating

  • Year and Rating
Figure: Year From 2017 vs. Rating
Although there is a linear relationship between year and rating, the slope of the best-fitted line is 0.00009698, which is extremely close to 0. So the linear relationship might indicate that year does not impact rating to a significant extent, which makes sense because any year could produce high-rating movie.

  • Genre and Rating
We used boxplot to show the distribution of ratings for different genres so we can draw the comparison between different genres. The plot shows the quartiles of the data with max and min values. Since we only included top 500 movies, most movies have ratings around 8 and we can tell that adventure, crime, and documentary genres are more likely to gain high ratings.
Figure: Genre vs. Rating

  • Budget and Rating
Surprisingly, there is no significant linear relationship between budget and rating.

Figure: Budget vs. Rating

Discussion

We choose the Rating of movies to be our dependent variable for two main reasons. The first reason is that Rating is normally distributed and the second reason is that we have a complete set of data points of Rating in our dataset.

In our analysis, we wanted to find out what factors would affect the Rating of movies. From the dataset, we can find that most of the top 500 movies are in R certificate and produced by the U.S. Primary language of movies is English which means that the U.S. might have greater power in producing good movies. For the Genre of movies, we found that the adventure, crime, and documentary genres are more likely to gain high ratings because their medians are above 8.0 in rating.

Comments

Popular posts from this blog

Box Office Success: Is It About Gender?

Film Industry in the US: A Journey from 2000 to 2019