Predicting Movies’ and TV Shows’ Popularity with Machine Learning
Shooting an iconic TV series is not easy. Especially when the popularity of TV shows peaked all over the world — the competition was never fiercer, and huge budgets are at stake. However, with the help of science, hitting the jackpot is quite real.
Alongside the main product, our team works on several projects and solves untypical and at the same time vital tasks with the help of machine learning and artificial intelligence.
For example, once, a large media channel producing TV and video content and using our main product, asked if we could predict a TV series’ popularity at the stage of its creation. As when forecasting the advertising inventory sale and distribution, we work with big data and account many events and adjustments, and the question turned into an exciting quest.
I will try to share the technical recipe of commercially successful series with you in this article. First you take a temporal assessment of this or that genre popularity. Then you define the optimal set of settings, mingle it with weighted coefficients of actors and scenarios in all genres throughout all times, spice it with socio-demographic coefficients of artists’ perception and find the optimal combination. Adding a pinch of magic on top and serving the table together with bloggers will be another step. Oh, and when launching your series, do not forget to consider seasonality and external events.
That will be it.
Now, how did we select the “ingredients” for this ultimate recipe? In other words, how did the sets of attributes appear for testing?
Any experiment requires data, preferably open. The optimal option for us in that case was the IMDb ratings dataset, on the basis of which we identified challenging correlations and trained the algorithm to predict the popularity of the new series (machine learning on the Random Forest algorithm).
For our study, we took not the preprocessed IMDb sample from Kaggle, but all the original data from the IMDb website to train our algorithm. All data was compiled and preprocessed into one large dataset of 418,334 film size (or 34,427,319 lines of normalized data) versus 5000 for Kaggle.
The next stage was the training of various Random Forest combinations using H2O in R. The optimal set turned out to be a combination of 20 decision trees, up to ten branches deep each.
For the model, 26 main parameters were defined and analysed, such as:
- genres and settings;
- cast and film directors;
- the correlation between the year of release and the film or series rating;
- the temporal genre popularity;
- actors, screenwriters, directors ratings, depending on the genre;
- socio-demographic perception.
To correctly assess these parameters, we analyzed IMDb dataset with additional external information.
Genre is the first parameter one would check. Setting is another important parameter, affecting popularity of modern series and computer games. Examples of settings include: fantasy, science fiction, cyberpunk, steampunk, Wild West, zombies, and so on.
The settings could be attributed to subgenres, but we singled them out as separate genres since each setting has its own audience. Take fantasy, for example — its audience is quite small, but distinct. The exception is the “Game of Thrones” series, which glued to screens people far from being fantasy fans and received very high ratings.
The description to the “Game of Thrones” series includes the following genres and settings:
On IMDb, the series was attributed to the action, adventure, and drama genres, not considering the combined set of settings. Thus, some of the dataset information was lost due to data consolidation. We accepted this assumption but put the task of enriching our database with extra data on our list.
The diagrams below show films and TV series genres ratings:
As you can see, the average score of TV series is higher than that one of the full-length films. Lately, film studios have taken the quality of this type of content more seriously. We also assume that this attention is caused by the increase in monetisation of TV series.
Let us take a look at the dynamics of movie ratings changes since 1980 .
It is noticeable that the average movie rating goes up, but over almost the entire review period, more and more full-length pictures have been receiving low marks. According to the second histogram, the television series industry overcame the crisis of the 2000s and has steadily grown in popularity among viewers. A different matter is the rating to financial success ratio, though.
For a long time, the industry relied on feature films, and the series was not considered creatively full-value and financially profitable. But any story has a turning point. In this case, it was “Lost” (8.4 points on IMBd, 2004–10), which became iconic and proved that a series could make money. In the “after the Lost” era, TV shows do not recognise the former lack of viewers and funding.
However, not all series are equally successful. Many projects were highly praised for directing, acting or special effects, but were halted due to poor commercial gain. For example, “Firefly” (2002–03) was not continued despite having 9 points on IMDb. At the same time, there are long-running series, whose quality varies from year to year, but overall popularity is steady, maintained by audience sympathy and discussion.
Therefore, we considered both the quality of the content (IMDb rating) and commercial success (as the liveliness of movie discussion or number of comments).
During the research, we found that particular genres stand out at different periods. In the 1990s, every family watched comedies. Thrillers and melodramas came to replace them, then horror and detective stories stepped up. Musicals became trendy. After the release of “Avatar,” a bunch of works with 3D-visualization came out. Just as with fashion, the popularity of genres is cyclical. Comedies are an example:
The release of the considered to be cult TV series coincided with a change in public preferences. From that we can conclude that hitting the newest trends resonates in high ratings and numerous comments. New genres and the attention of the audience open more financial perspectives. New faces appear, well-known names get up off the bench.
The audience assigns most of the actors for a definite character. It affects their genre “menu” and personal ratings. An actor highly appreciated in a comedy will not always be well received in drama or fantasy. However, a charismatic actor evokes clear associations and creates a unique mood in the film project of a given genre. For example, Anthony Hopkins is a recognized legend of drama and thriller.
Another example is the comedian Adam Sandler. This is how his ratings vary depending on genre and time:
There is an additional financial benefit in attracting newcomers and temporarily unengaged, but well-known actors. Nevertheless, along with any trend peaking, project budgets are aiming higher up too. Inflated royalties for irreplaceable actors can have a detrimental effect on project duration.
Ratings and discussions reflect the public opinion on the project and its media presence in general. Quantitative indicators can tell us a lot, but the semantic analysis of comments provides much more useful information.
It is more accurate to assess actors’ characteristics by associative ranks, by genres and settings they are more often credited to. It also seems useful to estimate the number of positive and negative reviews about both actors and screenwriters and directors.
Remember how you chose the last series to watch. You may have been influenced by articles or advertisements, ratings, friend or a review blogger advice. The modern viewer loves blogs, personal experiences, and reviews. Timely and properly highlighted series receives bonus points at the start.
The difference between the channels should be addressed. For example, HBO, FX, and BBC are struggling for airtime, while Netflix is a subscription-based streaming provider.
Some barely anticipated external events can affect the release. For example, the Writers Guild of America strike in 2007-08 caused ratings and viewership decline as well as changes in the duration and timing of many TV shows.
Nevertheless, there are external factors that always need to be considered:
- Seasonality. Summer is not the time to launch a new series. Summer is the time for blockbusters.
- The competition of major projects and TV shows between channels.
- Competition with big scheduled events. No one will watch a new series during the Super Bowl or the World Cup.
Large channels already employ these simple marketing techniques. But as in any other field, more often they are starting to resort to science and machine learning as well. Netflix has extended its use of Big Data to restart the “House of Cards” series with Kevin Spacey, estimating the show’s popularity before it was made.
It is not known for sure whether they created a universal algorithm, or it was a one-time study. For now, there is no evidence for a comprehensive program existence which would forecast and select all the film product components.
Still, series are an art, not an exact mathematical model. There are hundreds and thousands of interconnections, the notorious “chemistry.” One does not just estimate roughly the most popular genre, the most popular actor, the luckiest director, and get a masterpiece.
That being said, machine learning algorithms are able to evaluate all possible combinations and suggest the success probability of their combination, as well as the effect of each minor success or failure on the overall result.
Guessing which setting will break out is probably the key to becoming an iconic series. “Game of Thrones” success inspired dozens of medieval or alternative history and dark fantasy shows, the same way as “The Sopranos,” “Friends,” or “Buffy the Vampire Slayer” contributed to related genres before.
There is no clear-cut answer for what is going to be the next hit. But some contextual factors, such as books, motion pictures, videogames, and other media popularity, let us speculate or at least indirectly evaluate the general public demand.
We’ve managed to create an algorithm which can predict a new film project’s popularity. A back-test was conducted on more than 80,000 projects. The forecast error distribution is on the graph below:
Take “The Big Bang Theory.” Our algorithm predicts its relevance and commercial success based on historical data. We can safely say, that using our algorithm back in 2007, the creators would be more confident in worthy results, achieved later.
“The Big Bang Theory” is a sound example of the genre’s cyclical popularity. Seasoned sitcom screenwriters Chuck Lorre and Bill Prady caught another wave of interest in comedies just in time and unearthed young talents.
The series surpassed all expectations, and filming lasted until recently. But by 2018, the budgets and the cast fees had increased drastically, which contributed to the show’s cancellation. This fits our conclusions on the rise in popularity and financial gains at the initial stage and the high financial costs when you are on a roll.
According to statistics, 22% of all film budgets form 78% of all popular content. Сommercial success comes down to money less than to the ability to hit the target in time with the right combination of actors.
Even our current modelling draft predicts ratings with more than 85% probability and no more than 10% value range. There is still a lot of room for algorithm improvement, and the data should be enriched. But using a system approach to predict the success of projects can already be regarded real, even for the filmmaking industry.
Want to film something legendary? Then it’s time to do some math!