Do you wonder how Netflix suggests movies that align your interests so much? Or maybe you want to build a system that can make such suggestions to its users too?
If your answer was yes, then you’ve come to the right place as this article will teach you how to build a movie recommendation system by using Python.
However, before we start discussing the ‘How’ we must be familiar with the ‘What.’
Check out our data science training to upskill yourself
Recommendation System: What is It?
Recommendation systems have become a very integral part of our daily lives. From online retailers like Amazon and Flipkart to social media platforms like YouTube and Facebook, every major digital company uses recommendation systems to provide a personalized user experience to their clients.
Some examples of recommendation systems in your everyday life include:
- The suggestions you get from Amazon when you buy products are a result of a recommender system.
- YouTube uses a recommender system to suggest videos suited for your taste.
- Netflix has a famous recommendation system for suggesting shows and movies according to your interests.
A recommender system suggests users products by using data. This data could be about the user’s entered interests, history, etc. If you’re studying machine learning and AI, then it’s a must to study recommender systems as they are becoming increasingly popular and advanced.
Types of Recommendation Systems
There are two types of recommendation systems:
1. Collaborative Recommendation Systems
A collaborative recommendation system suggests items according to the likeness of similar users for that item. It groups users with similar interests and tastes and suggests their products accordingly.
For example, suppose you and one other user liked Sholay. Now, after watching Sholay and liking it, the other user liked Golmaal. Because you and the other user have similar interests, the recommender system would suggest you watch Golmaal based on this data. This is collaborative filtering.
2. Content-Based Recommendation Systems
A content-based recommender system suggests items based on the data it receives from a user. It could be based on explicit data (‘Likes’, ‘Shares’, etc.) or implicit data (watch history). The recommendation system would use this data to create a user-specific profile and would suggest items based on that profile.
Building a Basic Movie Recommendation System
Now that we have covered the basics of recommender systems, let’s get started on building a movie recommendation system.
We can start building a movie recommendation system Python-based by using the full MovieLens dataset. This dataset contains more than 26 million ratings, 750,000 tag applications that are applied to over 45,000 movies. The tag genome data present in this dataset with more than 12 million relevance scores.
We are using the full dataset for creating a basic movie recommendation system. However, you’re free to use a smaller dataset for this project. First, we’ll have to import all the required libraries:
A basic movie recommendation system Python-based would suggest movies according to the movie’s popularity and genre. This system works based on the notion that popular movies with critical acclamation will have a high probability of getting liked by the general audience. Keep in mind that such a movie recommendation system doesn’t give personalized suggestions.
To implement it, we will sort the movies according to their popularity and rating and pass in a genre argument to get a genre’s top movies:
Input
md = pd. read_csv(‘../input/movies_metadata.csv’)
md.head()
Output
adult | belongs_to_collection | budget | genres | video | id | imdb_id | original_title | overview | revenue | title | |||||||
False | (‘id’L 10194, ‘name’: ‘Toy Story Collection’) | 30000000 | [{‘id’: 16, ‘name’: ‘Animvation’}… | False | 862 | tt0114709 | Toy Story | Led by Woody, Andy’s toys live happily… | 373554033 | Toy Story | |||||||
1 | False | NaN | 65000000 | {{‘id’: 12, ‘name’: ‘Adventure’}… | False | 8844 | tt0113497 | Jumanji | When siblings Judy and Peter… | 262797249 | Jumanji | ||||||
2 | False | (‘id’: 119050, ‘name’: ‘Grumpy Old Men) | 0 | {{‘id’: 10749, ‘name’: ‘Romance’}… | False | 15602 | tt0113228 | Grumpy Old Men | A family wedding reignites the ancient… | 0 | Grumpier Old Men | ||||||
3 | False | NaN | 16000000 | {{‘id’: 35, ‘name’: ‘Comedy’}… | False | 31357 | tt0114885 | Waiting to Exhale | Cheated on, mistreated and stepped… | 81452156 | Waiting to Exhale |
Input
md[‘genres’] = md[‘genres’].fillna(‘[]’).apply(literal_eval).apply(lambda x: [i[‘name’] for i in x] if isinstance(x, list) else [])
The Formula for Our Chart
For creating our chart of top movies, we used the TMDB ratings. We will use IMDB’s weighted rating formula to create our chart, which is as follows:
Weighted Rating (WR) = (iaouaouaouaouaou)
Here, v stands for the number of votes a movie got, m is the minimum number of votes a movie should have to get on the chart, R stands for the average rating of the movie, and C is the mean vote for the entire report.
Building the Charts
Now that we have the dataset and the formula in place, we can start building the chart. We’ll only add those movies to our charts that have a minimum of 95% votes. We’ll begin with creating a top 250 chart.
Input
vote_counts = md[md[‘vote_count’].notnull()][‘vote_count’].astype(‘int’)
vote_averages = md[md[‘vote_average’].notnull()][‘vote_average’].astype(‘int’)
C = vote_averages.mean()
C
Output
5.244896612406511
Input
m = vote_counts.quantile(0.95)
m
Output
434.0
Input
md[‘year’] = pd.to_datetime(md[‘release_date’], errors=’coerce’).apply(lambda x: str(x).split(‘-‘)[0] if x != np.nan else np.nan)
Input
qualified = md[(md[‘vote_count’] >= m) & (md[‘vote_count’].notnull()) & (md[‘vote_average’].notnull())][[‘title’, ‘year’, ‘vote_count’, ‘vote_average’, ‘popularity’, ‘genres’]]
qualified[‘vote_count’] = qualified[‘vote_count’].astype(‘int’)
qualified[‘vote_average’] = qualified[‘vote_average’].astype(‘int’)
qualified.shape
Output
(2274, 6)
As you can see, to get a place on our chart a movie must have a minimum of 434 votes. You may have noticed that the average rating a movie must have to enter our chart is 5.24.
Input
def weighted_rating(x):
v = x[‘vote_count’]
R = x[‘vote_average’]
return (v/(v+m) * R) + (m/(m+v) * C)
Input
qualified[‘wr’] = qualified.apply(weighted_rating, axis=1)
Input
qualified = qualified.sort_values(‘wr’, ascending=False).head(250)
With all of this in place, let’s build the chart:
upGrad’s Exclusive Data Science Webinar for you –
How upGrad helps for your Data Science Career?
Explore our Popular Data Science Certifications
Top Movies Overall
Input
qualified.head(15)
Output
title | year | vote_count | vote_average | popularity | genres | wr | |
15480 | Inception | 2010 | 14075 | 8 | 29.1081 | [Action, Thriller, Science Fiction, Mystery, A… | 7.917588 |
12481 | The Dark Knight | 2008 | 12269 | 8 | 123.167 | [Drama, Action, Crime, Thriller] | 7.905871 |
22879 | Interstellar | 2014 | 11187 | 8 | 32.2135 | [Adventure, Drama, Science Fiction] | 7.897107 |
2843 | Fight Club | 1999 | 9678 | 8 | 63.8696 | [Drama] | 7.881753 |
4863 | The Lord of the Rings: The Fellowship of the Ring | 2001 | 8892 | 8 | 32.0707 | [Adventure, Fantasy, Action] | 7.871787 |
292 | Pulp Fiction | 1994 | 8670 | 8 | 140.95 | [Thriller, Crime] | 7.868660 |
314 | The Shawshank Redemption | 1994 | 8358 | 8 | 51.6454 | [Drama, Crime] | 7.864000 |
7000 | The Lord of the Rings: The Return of the King | 2003 | 8226 | 8 | 29.3244 | [Adventure, Fantasy, Action] | 7.861927 |
351 | Forrest Gump | 1994 | 8147 | 8 | 48.3072 | [Comedy, Drama, Romance] | 7.860656 |
5814 | The Lord of the Rings: The Two Towers | 2002 | 7641 | 8 | 29.4235 | [Adventure, Fantasy, Action] | 7.851924 |
256 | Star Wars | 1977 | 6778 | 8 | 42.1497 | [Adventure, Action, Science Fiction] | 7.834205 |
1225 | Back to the Future | 1985 | 6239 | 8 | 25.7785 | [Adventure, Comedy, Science Fiction, Family] | 7.820813 |
834 | The Godfather | 1972 | 6024 | 8 | 41.1093 | [Drama, Crime] | 7.814847 |
1154 | The Empire Strikes Back | 1980 | 5998 | 8 | 19.471 | [Adventure, Action, Science Fiction] | 7.814099 |
46 | Se7en | 1995 | 5915 | 8 | 18.4574 | [Crime, Mystery, Thriller] |
Voila, you have created a basic movie recommendation system Python-based!
We will now narrow down our recommender system’s suggestions to genre-based so it can be more precise. After all, it is not necessary for everyone to like The Godfather equally.
Top Data Science Skills to Learn
SL. No
Top Data Science Skills to Learn
1
Data Analysis Programs
Inferential Statistics Programs
2
Hypothesis Testing Programs
Logistic Regression Programs
3
Linear Regression Programs
Linear Algebra for Analysis Programs
Narrowing Down the Genre
So, now we’ll modify our recommender system to be more genre-specific:
Input
s = md.apply(lambda x: pd.Series(x[‘genres’]),axis=1).stack().reset_index(level=1, drop=True)
s.name = ‘genre’
gen_md = md.drop(‘genres’, axis=1).join(s)
Input
def build_chart(genre, percentile=0.85):
df = gen_md[gen_md[‘genre’] == genre]
vote_counts = df[df[‘vote_count’].notnull()][‘vote_count’].astype(‘int’)
vote_averages = df[df[‘vote_average’].notnull()][‘vote_average’].astype(‘int’)
C = vote_averages.mean()
m = vote_counts.quantile(percentile)
qualified = df[(df[‘vote_count’] >= m) & (df[‘vote_count’].notnull()) & (df[‘vote_average’].notnull())][[‘title’, ‘year’, ‘vote_count’, ‘vote_average’, ‘popularity’]]
qualified[‘vote_count’] = qualified[‘vote_count’].astype(‘int’)
qualified[‘vote_average’] = qualified[‘vote_average’].astype(‘int’)
qualified[‘wr’] = qualified.apply(lambda x: (x[‘vote_count’]/(x[‘vote_count’]+m) * x[‘vote_average’]) + (m/(m+x[‘vote_count’]) * C), axis=1)
qualified = qualified.sort_values(‘wr’, ascending=False).head(250)
return qualified
We have now created a recommender system that sorts movies in the romance genre and recommends the top ones. We chose the romance genre because it didn’t show up much in our previous chart.
Read our popular Data Science Articles
Top Movies in Romance
Input
build_chart(‘Romance’).head(15)
Output
title | year | vote_count | vote_average | popularity | wr | |
10309 | Dilwale Dulhania Le Jayenge | 1995 | 661 | 9 | 34.457 | 8.565285 |
351 | Forrest Gump | 1994 | 8147 | 8 | 48.3072 | 7.971357 |
876 | Vertigo | 1958 | 1162 | 8 | 18.2082 | 7.811667 |
40251 | Your Name. | 2016 | 1030 | 8 | 34.461252 | 7.789489 |
883 | Some Like It Hot | 1959 | 835 | 8 | 11.8451 | 7.745154 |
1132 | Cinema Paradiso | 1988 | 834 | 8 | 14.177 | 7.744878 |
19901 | Paperman | 2012 | 734 | 8 | 7.19863 | 7.713951 |
37863 | Sing Street | 2016 | 669 | 8 | 10.672862 | 7.689483 |
882 | The Apartment | 1960 | 498 | 8 | 11.9943 | 7.599317 |
38718 | The Handmaiden | 2016 | 453 | 8 | 16.727405 | 7.566166 |
3189 | City Lights | 1931 | 444 | 8 | 10.8915 | 7.558867 |
24886 | The Way He Looks | 2014 | 262 | 8 | 5.71127 | 7.331363 |
45437 | In a Heartbeat | 2017 | 146 | 8 | 20.82178 | 7.003959 |
1639 | Titanic | 1997 | 7770 | 7 | 26.8891 | 6.981546 |
19731 | Silver Linings Playbook | 2012 | 4840 | 7 | 14.4881 | 6.970581 |
Now, you have a movie recommender system that suggests top movies according to a chosen genre. We recommend testing out this recommender system with other genres too such as Action, Drama, Suspense, etc. Share the top three movies in your favourite genre the recommender system suggests in the comment section below
Learn More About a Movie Recommendation System
As you must have noticed by now, building a movie recommendation system Python-based, is quite simple. All you need is a little knowledge of data science and a little effort to create a fully-functional recommender system.
However, what if you want to build more advanced recommender systems? What if you want to create a recommender system that a large corporate might consider using?
If you’re interested in learning more about recommender systems and data science, then we recommend taking a data science course. With a course, you’ll learn all the fundamental and advanced concepts of data science and machine learning. Moreover, you’ll study from industry experts who will guide you throughout the course to help you avoid doubts and confusion.
At upGrad, we offer multiple data science and machine learning courses. You can pick anyone from the following depending on your interests:
- PG Diploma in Data Science
- Master of Science in Data Science
- PG Certification in Data Science
- PG Diploma in Machine Learning and AI
- Master of Science in Machine Learning and AI
Apart from these courses, we offer many other courses in data science and machine learning. Be sure to check them out!
Final Thoughts
You now know how to build a movie recommendation system. After you have created the system, be sure to share it with others and show them your progress. Recommender systems have a diverse range of applications so learning about them will surely give you an upper hand in the industry.
Collaborative filtering is a type of recommendation system that approaches building a model based on the user’s preferences. The history of the users acts as the dataset for collaborative filtering. Collaborative filtering is of 2 types that are mentioned below:
The content-based filtering collects the data from the user and suggests the items accordingly. Some of its advantages, as well as disadvantages, are mentioned below:
The collaborative filtering algorithm is becoming the primary driving algorithm for many popular applications. More and more businesses are focusing on delivering rich personalized content. For example, you probably have seen this message on many e-commerce websites Customers who buy this also bought.What is collaborative filtering and what are its types?
1. User-based collaborative filtering : The idea behind this type of collaborative filtering is that we take a user for preference, let's say “A” and find other users having similar preferences and then providing “A” those preferences of these users that it has not encountered yet.
Item-based collaborative filtering : Here instead of finding the users with similar preferences, we find movies similar to “A”’s taste and recommend those movies that it has not watched yet. What are the advantages and disadvantages of content-based filtering?
Advantages
1. Unlike collaborative filtering, the model does not need to collect data about other users with similar preferences as it takes the suggestions from the primary user itself.
2. The model can recommend some of the best movies to you according to your preferences that only a few others have watched.
Disadvantages
1. This technique requires a lot of information about a certain domain so the quality of features it provides is more or less the same as the hand-engineered features.
2. Its ability to recommend movies is limited since it only works according to the existing interests of the users. Which popular applications use collaborative filtering algorithms?
The following are some of the applications having a popular user base worldwide:
1. YouTube uses this algorithm along with some other powerful algorithms to provide video recommendations on the home page.
2. E-commerce websites such as Amazon, Flipkart, and Myntra also use this algorithm to provide product recommendations.
3. Video streaming platforms are the biggest example here which use user rating, average rating, and related content to provide personalized suggestions.