January 2017 ~ River IQ

RECOMMENDATION ENGINE - content-based filtering & COLLABORATIVE FILTERING

Recommendation engines are probably among the best types of machine learning model known to the general public. Even if people do not know exactly what a recommendation engine is, they have most likely experienced one through the use of popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn, and Facebook. Recommendations are a core part of all these businesses, and in some cases, they drive significant percentages of their revenue.

The idea behind recommendation engines is to predict what people might like and to uncover relationships between items to aid in the discovery process (in this way, it is similar and, in fact, often complementary to search engines, which also play a role in discovery). However, unlike search engines, recommendation engines try to present people with relevant content that they did not necessarily search for or that they might not even have heard of.

Typically, a recommendation engine tries to model the connections between users and some type of item. If we can do a good job of showing our users movies related to a given movie, we could aid in discovery and navigation on our site, again improving our users' experience, engagement, and the relevance of our content to them.

However, recommendation engines are not limited to movies, books, or products. The techniques we will explore in this article can be applied to just about any user-to-item relationship as well as user-to-user connections, such as those found on social networks, allowing us to make recommendations such as people you may know or who to follow.

In the immortal words of Steve Jobs - “a lot of times, people don’t know what they want until you show it to them.”

The customer personalization journeys of Amazon and Netflix demonstrate just how powerful recommendation engines can be. See how these online giants built cutting edge recommendation engines that keep subscribers coming back for more.

Amazon

Netflix

Google Image Search

· A recommendation engine can engage audiences with the right content

· A recommendations engine can customize ads or sponsored content for a user based on their preferences

· A recommendations engine for publishing website

e.g. - https://boomtrain.com/

https://www.barilliance.com/

Types of recommendation models

Recommender systems are widely studied, and there are many approaches used, but there are two that are probably most prevalent:

· Content-based filtering

· Collaborative filtering

Ø Item-based collaborative filtering

Ø User- based collaborative filtering

Content-based filtering

Assume a “real world” case: “John’s favourite cake is Napoleon (left picture below). He went to a shop for it, but such cakes were sold out. John asked a marketer to recommend something similar and was recommended a Napoleon torte (right picture below) that has the same ingredients. John bought it.”

This is an example of pure content-based filtering in the real world. The marketer has recommended the torte considering the ingredients similarity. A content-based filtering system has similar intuition behind it.

Content-based (CB) filtering systems are systems recommending items similar to items a user liked in the past.

Before we proceed, let me define a couple of terms:

Item would refer to content whose attributes are used in the recommender models. These could be movies, documents, book etc.
Attribute refers to the characteristic of an item. A movie tag, words in a document are examples.

These systems focus on algorithms, which assemble user’s preferences into user’s profiles and all items information into items’ profiles. Then they recommend those items close to the user by similarity of their profiles.

A user profile might be seen as a set of assigned keywords (terms, features) collected by algorithm from items found relevant (or interesting) by the user.

An item profile is a set of assigned keywords (terms, features) of the item itself.

Actual profiles building process is handled by various information retrieval or machine learning techniques. For instance, the most frequent terms in the document describing an item can represent the item’s profile.

Now the example can be reformulated in recommender terms: John liked cake Napoleon, its ingredients formed John’s user profile. The system reviewed other available item profiles and found that the most similar is the “torte Napoleon” item profile. The similarity is high because both cake and torte have the same ingredients. This was the reason for the recommendation.

The principal advantage of the content-based filtering approach is in its nature: it can start to recommend as soon as there is information about items available. The latter means that a recommender system does not require any user input to recommend.

How do Content Based Recommender Systems work?

A content based recommender works with data that the user provides, either explicitly (rating) or implicitly (clicking on a link). Based on that data, a user profile is generated, which is then used to make suggestions to the user. As the user provides more inputs or takes actions on the recommendations, the engine becomes more and more accurate.

Collaborative filtering

This is Collaborative Filtering (CF) approach – recommendations were given by others who have similar tastes in the past, but who already experienced an item yet unknown to the current user.

Collaborative filtering systems require users to express opinions on items. They collect opinions and recommend items based on people’s opinions similarity. Those who agree most are the contributors.

Now the example can be reformulated again: John asked a recommendation about “best fit” drink. Collaborative filtering system reviewed opinions only those from people who have tried and liked Napoleon torte in the past. The recommended “Mint tea” is merely the highly rated item among others by these people.

Collaborative filtering systems usually review more than just one common item to define a set of users, which influence results. For example, John should been tried many various cakes, and his friends also must tried the same cakes in past, to get better recommendation (Movielens requires at least 20 movies to be rated before it produces recommendations [Movielens.org])

Item-based collaborative filtering

Item based collaborative filtering is a model-based algorithm for recommender engines. In item based collaborative filtering similarities between items are calculated from rating-matrix. And based upon these similarities, user’s preference for an item not rated by him is calculated. Here is a step-by-step worked out example for four users and three items. We will consider the following sample data of preference of four users for three items:

ID	user	item	rating
241	u1	m1	2
222	u1	m3	3
276	u2	m1	5
273	u2	m2	2
200	u3	m1	3
229	u3	m2	3
231	u3	m3	1
239	u4	m2	2
286	u4	m3	2

Step 1: Write the user-item ratings data in a matrix form. The above table gets rewritten as follows:

Here rating of user u1 for item m3 is 3. There is no rating for item m2 by user u1. And no rating also for item m3 by user u2.

Step 2: We will now create an item-to-item similarity matrix. The idea is to calculate how similar an item is to another item. There are a number of ways of calculating this. We will use cosine similarity measure. To calculate similarity between items m1 and m2, for example, look at all those users who have rated both these items. In our case, both m1 and m2 have been rated by users u2 and u3. We create two item-vectors, v1 for item m1 and v2 for item m2, in the user-space of (u2, u3) and then find the cosine of angle between these vectors. A zero angle or overlapping vectors with cosine value of 1 means total similarity (or per user, across all items, there is same rating) and an angle of 90 degree would mean cosine of 0 or no similarity. Thus, the two item-vectors would be,

v1 = 5 u2 + 3 u3

v2 = 3 u2 + 3 u3

The cosine similarity between the two vectors, v1 and v2, would then be:

cos (v1,v2) = (5*3 + 3*3)/sqrt[(25 + 9)*(9+9)] = 0.76

Similarly, to calculate similarity between m1 and m3, we consider only users u1 and u3 who have rated both these items. The two item vectors, v1 for item m1 and v3 for item m3, in the user-space would be as follows:

v1 = 2 u1 + 3 u3

v3 = 3 u1 + 1 u3

The cosine similarity measure between v1 and v3 is:

cos (v1,v3) = (2*3 + 3*1)/sqrt[(4 + 9)*(9+1)] = 0.78

We can similarly calculate similarity between items m2 and m3 using ratings given to both by users u3 and u4. The two item-vectors v3 and v4 would be:

v2 = 3 u3 + 2 u4

v3 = 1 u3 + 2 u4

And cosine similarity between them is:

cos (v2,v3) = (3*1 + 2*2)/sqrt[(9 + 4)*(1 + 4)] = 0.86

We now have the complete item-to-item similarity matrix as follows:

Step 3: For each user, we next predict his ratings for items that he had not rated. We will calculate rating for user u1 in the case of item m2 (target item). To calculate this we weigh the just-calculated similarity-measure between the target item and other items that user has already rated. The weighing factor is the ratings given by the user to items already rated by him. We further scale this weighted sum with the sum of similarity-measures so that the calculated rating remains within a predefined limits. Thus, the predicted rating for item m2 for user u1 would be calculated using similarity measures between (m2, m1) and (m2, m3) weighted by the respective ratings for m1 and m3

Rating = (2 * 0.76 + 3 * 0.86)/ (0.76+0.86) = 2.53

Recommender engine using item based collaborative filtering can be constructed using R package recommenderlab.

############### Collaborative filtering in R (Recommendation Engine) ################

R Script

https://1drv.ms/u/s!AmfDeS0cl4gZhSoklJF8JfP4Z0oy

Train Data Set for Model

https://1drv.ms/u/s!AmfDeS0cl4gZhSuzkbZ6h0E6-3Gt

Test Data Set for Model

https://1drv.ms/u/s!AmfDeS0cl4gZhSwlVEX3l7jli3Kf

# Set data path as per your data file (for example: "c://abc//" )

setwd("F:/Data Science/Data Science/Ashish/Recommendation Engine Dataset")

# If not installed, first install following three packages in R

#install.packages("recommenderlab")

library(recommenderlab)

library(reshape2)

library(ggplot2)

# Read training file along with header

tr<-read.csv("train_v2.csv",header=TRUE)

# Just look at first few lines of this file

head(tr)

# Remove 'id' column. We do not need it

tr<-tr[,-c(1)]

# Check, if removed

tr[tr$user==1,]

# Using acast to convert above data as follows:

# m1 m2 m3 m4

# u1 3 4 2 5

# u2 1 6 5

# u3 4 4 2 5

g<-acast(tr, user ~ movie)

# Check the class of g

class(g)

# Convert it as a matrix

R<-as.matrix(g)

# Convert R into realRatingMatrix data structure

# realRatingMatrix is a recommenderlab sparse-matrix like data-structure

r <- as(R, "realRatingMatrix")

# view r in other possible ways

as(r, "list") # A list

as(r, "matrix") # A sparse matrix

# I can turn it into data-frame

head(as(r, "data.frame"))

# normalize the rating matrix

r_m <- normalize(r)

r_m

as(r_m, "list")

# Draw an image plot of raw-ratings & normalized ratings

# A column represents one specific movie and ratings by users

# are shaded.

# Note that some items are always rated 'black' by most users

# while some items are not rated by many users

# On the other hand a few users always give high ratings

# as in some cases a series of black dots cut across items

image(r, main = "Raw Ratings")

image(r_m, main = "Normalized Ratings")

# Can also turn the matrix into a 0-1 binary matrix

r_b <- binarize(r, minRating=1)

as(r_b, "matrix")

# Create a recommender object (model)

# Run anyone of the following four code lines.

# Do not run all four

# They pertain to four different algorithms.

# UBCF: User-based collaborative filtering

# IBCF: Item-based collaborative filtering

# Parameter 'method' decides similarity measure

# Cosine or Jaccard

rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))

rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Jaccard",nn=5, minRating=1))

rec=Recommender(r[1:nrow(r)],method="IBCF", param=list(normalize = "Z-score",method="Jaccard",minRating=1))

rec=Recommender(r[1:nrow(r)],method="POPULAR")

# Depending upon your selection, examine what you got

print(rec)

names(getModel(rec))

getModel(rec)$nn

############Create predictions#############################

# This prediction does not predict movie ratings for test.

# But it fills up the user 'X' item matrix so that

# for any userid and movieid, I can find predicted rating

# dim(r) shows there are 6040 users (rows)

# 'type' parameter decides whether you want ratings or top-n items

# get top-10 recommendations for a user, as:

# predict(rec, r[1:nrow(r)], type="topNList", n=10)

recom <- predict(rec, r[1:nrow(r)], type="ratings")

recom

########## Examination of model & experimentation #############

########## This section can be skipped #########################

# Convert prediction into list, user-wise

as(recom, "list")

# Study and Compare the following:

as(r, "matrix")[1:10,1:10] # Has lots of NAs. 'r' is the original matrix

as(recom, "matrix") # Is full of ratings. NAs disappear

as(recom, "matrix")[1:10,1:10] # Show ratings for all users for items 1 to 10

as(recom, "matrix")[5,3] # Rating for user 5 for item at index 3

as.integer(as(recom, "matrix")[5,3]) # Just get the integer value

as.integer(round(as(recom, "matrix")[6039,8])) # Just get the correct integer value

as.integer(round(as(recom, "matrix")[368,3717]))

# Convert all your recommendations to list structure

rec_list<-as(recom,"list")

head(summary(rec_list))

# Access this list. User 2, item at index 2

rec_list[[2]][2]

rec_list[[1837]][4]

# Convert to data frame all recommendations for user 1

u1<-as.data.frame(rec_list[[1]])

attributes(u1)

class(u1)

head(u1)

# Create a column by name of id in data frame u1 and populate it with row names

u1$id<-row.names(u1)

# Check movie ratings are in column 1 of u1

# Now access movie ratings in column 1 for u1

u1[u1$id==3952,]

########## Create submission File from model #######################

# Read test file

test<-read.csv("test_v2.csv",header=TRUE)

head(test)

# Get ratings list

rec_list<-as(recom,"list")

head(summary(rec_list))

ratings<-NULL

# For all lines in test file, one by one

for ( u in 1:length(test[,2]))

{

# Read userid and movieid from columns 2 and 3 of test data

userid <- test[u,2]

movieid<-test[u,3]

# Get as list & then convert to data frame all recommendations for user: userid

u1<-as.data.frame(rec_list[[userid]])

# Create a (second column) column-id in the data-frame u1 and populate it with row-names

# Remember (or check) that rownames of u1 contain are by movie-ids

# We use row.names() function

u1$id<-row.names(u1)

# Now access movie ratings in column 1 of u1

x= u1[u1$id==movieid,1]

# print(u)

# print(length(x))

# If no ratings were found, assign 0. You could also

# assign user-average

if (length(x)==0)

{

ratings[u] <- 0

}

else

{

ratings[u] <-x

}

length(ratings)

tx<-cbind(test[,1],round(ratings))

# Write to a csv file: submitfile.csv in your folder

write.table(tx,file="submitfile.csv",row.names=FALSE,col.names=FALSE,sep=',')

Term	Meaning
Application	User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar	A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program	The process running the main() function of the application and creating the SparkContext
Cluster manager	An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode	Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Worker node	Any node that can run application code in the cluster
Executor	A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task	A unit of work that will be sent to one executor
Job	A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save,collect); you'll see this term used in the driver's logs.
Stage	Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

River IQ

Saturday, 28 January 2017

RECOMMENDATION ENGINE - CONTENT-BASED FILTERING & COLLABORATIVE FILTERING

Monday, 9 January 2017

How Spark runs on clusters.

About

Site Links

Popular Posts

Join the Team