Factors Influencing the Success of a Movie
Many of us
have been an enthusiastic fan of movies and like to explore great movies
through looking at different film ratings and reviews websites such as IMDb,
Rotten Tomatoes, etc. While reviewing the top movies list, we might be wondering, what are the primary factors that influence a
movie’s success? Is it budget, box office, language, or movie genre?
Hence, we, the
three movie lovers and data lovers too, decided to conduct a statistical analysis
regarding the influential factors of a movie’s success. We looked at IMDb Top 500 Greatest movies of all time and created our dataset. Our
categorical variables consist
of information on certificate, genre, country, and language. Numerical variables consist
of information on duration, rating, vote, gross, and budget. There are 9
different certificates, 21 languages and 29 countries in total. Some movies’ gross and budget
information were not available and we ignored those movies.
There are some
sources of noise in
the dataset. For example, some foreign movies have gross and budget in foreign
currencies and we converted these foreign currencies to USD. It might be
inaccurate since exchange rates are always fluctuating and it might not
accurately reflect the exact gross and budget for foreign movies. There are
also some old movies with gross and budget collected a long time ago.
The data
is partially related to people as
ratings are crowd-sourced opinions. Ratings can reflect how the public view the
qualities of movies. However, there are certain groups of people who are more
likely to rate. For example, people who really like or dislike this particular
movie or casting are more likely to rate to express their strong opinions.
People who are movie lovers are also more likely to rate but in a more
objective way.
We included 500 movies with 9 independent variables and one dependent variable “Rating.” We calculated the total count, mean, standard deviation, minimum, maximum, 25th, 50th, and 75th percentile.
We included 500 movies with 9 independent variables and one dependent variable “Rating.” We calculated the total count, mean, standard deviation, minimum, maximum, 25th, 50th, and 75th percentile.
Figure: Distribution of Rating |
Independent Variables
- Certificate
Out
of 500 movies, there are 199 Restricted, 100 Not-Rated, 80 Parental Guidance
Suggested, 63 Parents Strongly Cautioned, and 32 General Audience movies. It’s
reasonable that a large amount of the movies are Restricted, because good
movies usually use violence, sexual and dark elements as art expressions to
convey deeper meanings. On the contrary, not restricted movies are usually more
widely accepted and viewed by children, so it might not contain adult plots.
- Duration
The
distribution of Durations shows that most top movies’ duration is concentrated
around 120 minutes.
- Year
Figure: Year Distribution |
- Vote
The
distribution graph of Vote is left skewed, which shows that people actually do not
vote that much even for the top 500 movies.
Gross
and Budget have very similar distribution and graphs. They are both left
skewed. For gross, most data is concentrated in 0-200 million, whereas budget
is concentrated around 0-25 million.
- Country & Language
From
the graph below, we can tell that most of the top 500 movies were produced in
the USA with English. As one of the most widely-spoken languages, English
movies are easier to spread out worldwide.
- Genre
Although
there are 22 genres, over half of the top movies are drama.
Linear Regression:
Independent Variables vs. Rating
- Vote and Rating
Figure: Vote vs. Rating |
- Gross and Rating
There
is no linear relationship between gross and rating, so higher rating does not
necessarily correlate with higher profits for a movie.
- Duration and Rating
There
is a linear relationship between duration and rating. The slope of the
best-fitted line is 0.00254342.
- Year and Rating
Figure: Year From 2017 vs. Rating |
- Genre and Rating
We
used boxplot to show the distribution of ratings for different genres so we can
draw the comparison between different genres. The plot shows the quartiles of
the data with max and min values. Since we only included top 500 movies, most
movies have ratings around 8 and we can tell that adventure, crime, and
documentary genres are more likely to gain high ratings.
- Budget and Rating
Surprisingly,
there is no significant linear relationship between budget and rating.
Discussion
We
choose the Rating of movies to be our dependent variable for two main reasons.
The first reason is that Rating is normally distributed and the second reason
is that we have a complete set of data points of Rating in our dataset.
In our analysis, we
wanted to find out what factors would affect the Rating of movies. From the
dataset, we can find that most of the top 500 movies are in R certificate and
produced by the U.S. Primary language of movies is English which means
that the U.S. might have greater power in producing good movies. For the Genre
of movies, we found that the adventure, crime, and documentary genres are more
likely to gain high ratings because their medians are above 8.0 in rating.
Comments
Post a Comment