top of page

MOVIE RECOMMENDATION SYSTEM

BY AMRITA DUTTA

Home: Welcome

MOTIVATION

In an era where everything is being automized, why should we waste time in deciding what movie to watch next? With millions of movies streaming, choosing one movie seems like a task now. Research has shown on average a person spends nearly 309.4 hours per year searching for movies to watch?! Who's got so much time?  
Fret not! We have a system to do the job for you.

Home: Watch

ABOUT

My team and I have done the project as part of our curriculum at Columbia University in the City of New York. The project illustrates a movie recommendation engine using TMDb dataset. We have used three different models to perform content-based filtering.
Content-based filtering uses the characteristics of the items to find similarity between them, For instance, if a user watches a lot of sci-fi movies, he will be recommended another sci-fi movie because they are "similar". Here, I have outlined the process of data extraction, exploratory data analysis and the methods used to build the engine. The outcome is here for you to judge. Would love to hear your thoughts and comments - contact me by email!

Home: About My Project

DATA

The data has been extracted from the Movie Database TMDb using APIs. Post cleaning, the dataset comprises of 25,870 and 29 features such as title, rating, cast, crew, genres etc.
After processing the data and performing some exploratory data analysis, here are some interesting data insights:

​

 Here's distribution of revenue: It shows the highest grossing movie is of genres science-fiction, action and adventure  and as it turns out, the movie in the graph corresponds to Avatar! 

​

Scroll through to see top rated movies by genres and no. of movie release per genre.

Also, scroll for some cool visualizations!

Screen Shot 2019-12-23 at 12.42.25 AM.pn
Home: Intro
Home: Pro Gallery

WORD-CLOUD VISUALIZATIONS

Based on Genres

Genre: Horror
Genre: Murder
Genre: Animation
Genre: Romantic
Home: Gallery

MODELS

Home: Text

SHORTEST PATH

Model 1

We have made use of the concept of networks and the shortest path property. Here, each movie is represented as a node and the edges represent weighted similarity scores. An edge between each pair of nodes is weighted against genre overlap, director overlap and the country of production between them. So, shorter the path between movies, greater is the similarity.
When a user inputs a movie title, that movie becomes the source node and the shortest paths from that particular movie to others are calculated and top few  movies are recommended based with shortest path.
In the image below, for the movie "Joker", Movie 5 is the top most recommended movie since it has the lowest shortest path, hence the most similiar.

Shortest Path method
Home: Image

COSINE SIMILARITY

Model 2

For the cosine similarity based model,  movie features like genre, cast, and overview are used to score the movie similarity. For every movie, a string is created concatenating the aforementioned features. Then, a similarity score is computed for each movie in reference to the input movie title. The similarity score is calculated as follows: Every movie is plotted in a multidimensional graph, where each axis represents words from movie data. Based on this, all movies are plotted as vectors representing array of word frequency. Then, cosine angle between the input movie and every other movie is calculated. Lesser the angle, higher the similarity.
Finally, the model returns recommended movies in descending order of the similarity scores.
For instance, the movie "Joker" is more similar to "King of Crime" than "Blood Brother" based on the angle.

Home: Image

TF-IDF

Model 3

The model recommends movies by taking keywords as input. The model is based on the similarity of movie plots. Term Frequency (TF) is the frequency of a word in a document. It tells us the importance of each word in a document. Inverse Document Frequency (IDF) is the inverse of the document frequency among all the documents. Inverse Document Frequency is the relative count of documents containing the term. The overall importance of each word in the documents in which they appear is equal to TF * IDF. 

In this, since we have used the TF-IDF vectorizer, calculating the dot product will directly give us the cosine similarity score. Thus, based on the scores, movies are recommended.

Home: Image

RESULTS

Below are the results of the three models. The first two models, that is, the shortest path and cosine similarity models take in a movie title as the input. This lets the user search for movies based on what he has watched previously.
The third model, that is, the TD-IDF model recommends movies based on input keywords which lets the user search for movies based on how he is "feeling" at the moment.
We've got pretty decent recommendation from all three models!

Home: Text

SHORTEST PATH RESULT

The image below shows the top 5 movie recommendations for the movie "Santa World" using the shortest path model.

Screen Shot 2019-12-23 at 2.55.27 AM.png
Home: Image

COSINE SIMILARITY RESULT

The image below shows the top movie recommendations for the movie "Santa World" using the cosine similarity model.

Screen Shot 2019-12-23 at 2.59.55 AM.png
Home: Image

TF-IDF RESULT

The image below shows the top movie recommendations for the keywords "snow" and "christmas" using the TF-IDF model. It also shows the similarity scores for each movie.

Screen Shot 2019-12-23 at 2.55.57 AM.png
Home: Image

CONCLUSION

The all-new Recommendation Engine saves time, recreates, personalizes the experience of selecting the right movie by providing tailored options. Our content-based filtering has an advantage since we do not need other users' profile to recommend movies. Just the data of movies is all we need!
On a side note, a major challenge when developing a recommendation tool is subjectivity. For a movie recommendation, there is no right answer as preference of the user is a driving factor. Taking the right set of features optimizes the recommendation results. Hence, the three different models takes into account different features of movies to form a robust recommendation system.

Home: Text
Home: Contact

©2019 by Amrita Dutta

bottom of page