Content Based Filtering

Soumyajit Choudhury
5 min readDec 4, 2020

The architecture and intuition of content based filtering

Recommender systems are active information filtering system which personalize the information coming to a user based on his interests,relevance of the information etc. Recommender systems are used widely for recommending movies,articles,restaurants,items to buy etc.

There are two main methods of filtering.

1>Content Based Filtering

2>Collaborative Filtering

Content Based Recommender-A content based recommender works with data that the user provides,either explicitly by giving rating or implicitly by clicking on a link.Based on that data, a user profile is generated,which is then used to make suggestions to the user.As the user provides more inputs or take takes actions on the recommendations,the engine becomes more and more accurate.

If a user is watching a movie,the system will check about other movies of similar content or the same genre of the moviethe user is watching.And it will recommend movies based on that.

A basic architecture of content based filtering

Intuition:-

User rating-Movie Table

Assuming ,we have User 1, who had saw Movie 1(Action) rated it 5/5,movie 2(Romance) and rated it 4/5,and Movie 3(Action) rated it 5/5 respectively.

Now,if User 2 watches the Movie 6(Action) and rates it 5/5,so the content based recommendation system will most probably recommend the action Movie 1 or action Movie 3 for User 2,based on ratings and type of the movie with which both users are related.

Based on the above table we can see that Movie 1,Movie 2 and Movie 3 tends to be action based movie ,while Movie 4 and Movie 5 tend to be romantic movies. We can also conclude that User 1 and User 2 prefer Action movies and vice-versa for User 3 and User 4 respectively.

where,

N(U)=No. of Users=4

N(M)=No. of Movies=5

N(Features)=2 i.e(Action and Romantic movies)

Now, Lets Consider ,Movie 1, assuming the X-intercept value as X(0)=1 ,and cosidering feature values, we can find feature vactors of Movie 1 ,as vector matrix(3,1) as [1,0.9,0],similarly we get feature vectors for movie 2,3,4,5

Now ,for each user “j” learns parameter Theta(j)==Real Number^(3) i.e.(Feature(2)+1),So by using (Theta^(j))^(T)*X^(i) we can find the rating of Movie(i),using the parameter vector Theta(i) for each user where (i) is no. of user.

Now,lets understand in a better way.

Calculation

So,based on the Calculations,we saw that the predicted rating for Movie 3,by User 1 ,would be approximately 4.5 stars.

Now ,lets talk about some terms we need to understand first.

Important Terms

Now,lets see how we can predict the Actual value of the Parameter Vector of each User.

So,To learn Parameter Vector(Theta^(j)),this can be considered as Linear Regression Problem,so that the actual parameter vector and predicted parameter vector values are close as much as possible,as the loss function.

We can represent the equation as:-

Just like Linear Regression models, we want to choose the specefic parameter vector,to minimize Mean Squared Error term, now we can also add the regularization term to avoid Overfitting problem.

After minimizing this term ,we can get a pretty good estimate of a parameter vector of user ‘j’s’ Movie Rating.

We can represent the equation as,

Equation for a single user and multiple users

We can represent the whole equation i.e. loss function,as our “j for (Theta(1),Theta(2),….,Theta(n)”,and this can be multiplied by activation function,and subtracted from the predicted parameter vector in order to minimize the difference between the actual and predicted parameter vector.

We can also update the gradient descent.

Equation for gradient descent
#!/usr/bin/env python
# coding: utf-8

# In[1]:


import numpy as np
import pandas as pd


# In[2]:


credits=pd.read_csv('D:/Dataset/tmdb_5000_credits.csv')


# In[3]:


movies_df=pd.read_csv('D:/Dataset/tmdb_5000_movies.csv')


# In[4]:


credits.head()


# In[5]:


movies_df.head()


# In[6]:


print("Credits: ",credits.shape)
print("Movies Dataframe: ",movies_df.shape)


# In[8]:


credits_column_renamed=credits.rename(index=str,columns={"movie_id": "id"})


# In[9]:


movies_df_merge=movies_df.merge(credits_column_renamed,on='id')
movies_df_merge.head()


# In[10]:


movies_cleaned_df=movies_df_merge.drop(columns=['homepage','title_x','title_y','status','production_countries'])
movies_cleaned_df.head()


# In[11]:


movies_cleaned_df.info()


# In[12]:


v=movies_cleaned_df['vote_count']


# In[13]:


v


# In[14]:


R=movies_cleaned_df['vote_average']


# In[15]:


R


# In[16]:


C=movies_cleaned_df['vote_average'].mean()


# In[17]:


C


# In[18]:


m=movies_cleaned_df['vote_count'].quantile(0.70)


# In[19]:


m


# In[20]:


movies_cleaned_df['vote_count'].sum()


# In[21]:


##W=(Rv+Cm)/v+m


# In[22]:


movies_cleaned_df['weighted_average']=((R*v)+(C*m))/(v+m)


# In[23]:


movies_cleaned_df.head()


# In[24]:


movie_sorted_ranking=movies_cleaned_df.sort_values('weighted_average',ascending=False)


# In[25]:


movie_sorted_ranking


# In[26]:


movie_sorted_ranking[['original_title','vote_count','vote_average','weighted_average','popularity']].head(20)


# In[ ]:





# In[ ]:





# In[27]:


import matplotlib.pyplot as plt
import seaborn as sns

weight_average=movie_sorted_ranking.sort_values('weighted_average',ascending=False)


# In[28]:


plt.figure(figsize=(12,6))
axis1=sns.barplot(x=weight_average['weighted_average'].head(10),y=weight_average['original_title'].head(10),data=weight_average)
plt.xlim(4,10)
plt.xlabel('Weighted Average Score',weight='bold')
plt.ylabel('Movie title',weight='bold')
plt.savefig('best_movies.png')


# In[29]:


popularity=movie_sorted_ranking.sort_values('popularity',ascending=False)


# In[30]:


popularity


# In[31]:


plt.figure(figsize=(12,6))
ax=sns.barplot(x=popularity['popularity'].head(10),y=popularity['original_title'].head(10),data=popularity)
plt.title('Most popular by votes',weight='bold')
plt.xlabel('score of popularity',weight='bold')
plt.ylabel('Movie Title',weight='bold')
plt.savefig('best_popular_movies.png')


# In[32]:


from sklearn.preprocessing import MinMaxScaler
scaling=MinMaxScaler()
movie_scaled_df=scaling.fit_transform(movies_cleaned_df[['weighted_average','popularity']])


# In[33]:


movie_scaled_df


# In[34]:


movie_normalized_df=pd.DataFrame(movie_scaled_df,columns=['weighted_average','popularity'])
movie_normalized_df.head()


# In[ ]:


##CONSIDER 50% WEIGHT TO POPULARITY AND 50% WEIGHT TO WEIGHTED_AVERAGE


# In[35]:


movies_cleaned_df[['normalized_weight_average','normalized_popularity']]=movie_normalized_df


# In[36]:


movies_cleaned_df.head()


# In[37]:


movies_cleaned_df['score']=movies_cleaned_df['normalized_weight_average']*0.5 +movies_cleaned_df['normalized_popularity']*0.5


# In[38]:


movies_scored_df=movies_cleaned_df.sort_values(['score'],ascending=False)


# In[39]:


movies_scored_df[['original_title','normalized_weight_average','normalized_popularity','score']].head(20)


# In[ ]:


#CONTENT BASED RECOMMENDATION SYSTEM


# In[40]:


movies_cleaned_df.head(1)['overview']


# In[41]:


from sklearn.feature_extraction.text import TfidfVectorizer

tfv=TfidfVectorizer(min_df=3,max_features=None,strip_accents='unicode',analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1,3),stop_words='english')


# In[44]:


tfv


# In[43]:


movies_cleaned_df['overview']=movies_cleaned_df['overview'].fillna('')


# In[46]:


tfv_matrix=tfv.fit_transform(movies_cleaned_df['overview'])


# In[47]:


tfv_matrix.shape


# In[49]:


from sklearn.metrics.pairwise import sigmoid_kernel
sig=sigmoid_kernel(tfv_matrix,tfv_matrix)


# In[ ]:


#REVERSE MAPPING OF INDICES AND MOVIES TITLE


# In[50]:


indices=pd.Series(movies_cleaned_df.index,index=movies_cleaned_df['original_title']).drop_duplicates()


# In[51]:


indices


# In[52]:


indices['Newlyweds']


# In[53]:


sig[4799]


# In[55]:


list(enumerate(sig[indices['Newlyweds']]))


# In[58]:


sorted(list(enumerate(sig[indices['Newlyweds']])),key=lambda x: x[1],reverse=True)


# In[59]:


def give_rec(title,sig=sig):
idx=indices[title]
sig_scores=list(enumerate(sig[idx]))
sig_scores=sorted(sig_scores,key=lambda x: x[1],reverse=True)
#10 best similar movies
sig_scores=sig_scores[1:11]
movie_indices=[i[0] for i in sig_scores]
return movies_cleaned_df['original_title'].iloc[movie_indices]


# In[61]:


give_rec('Avatar')


# In[62]:


give_rec('Titanic')


# In[63]:


give_rec('Star Wars')

TF-IDF has been used here for processing. The TF*IDF algorithm is used to weigh a keyword in any content and assign the importance to that keyword based on the number of times it appears in the document.More importantly,it checks how relevant the keyword is throughout the web,which is referred to as corpus.

For a term t in a document d,the weight wt,d of term t in document d is given by

Wt,d=TFt,d log(N/DFt)

where:

  1. TFt,d is the number of occurences of t in a document d
  2. DFt is the number of documents containing the term t.
  3. N is the total number of documents in the corpus.
TF*IDF

THANK YOU…

--

--