Saturday, 28 January 2017

RECOMMENDATION ENGINE - CONTENT-BASED FILTERING & COLLABORATIVE FILTERING

RECOMMENDATION ENGINE - content-based filtering & COLLABORATIVE FILTERING  
Recommendation engines are probably among the best types of machine learning model known to the general public. Even if people do not know exactly what a recommendation engine is, they have most likely experienced one through the use of popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn, and Facebook. Recommendations are a core part of all these businesses, and in some cases, they drive significant percentages of their revenue.
The idea behind recommendation engines is to predict what people might like and to uncover relationships between items to aid in the discovery process (in this way, it is similar and, in fact, often complementary to search engines, which also play a role in discovery). However, unlike search engines, recommendation engines try to present people with relevant content that they did not necessarily search for or that they might not even have heard of.
Typically, a recommendation engine tries to model the connections between users and some type of item. If we can do a good job of showing our users movies related to a given movie, we could aid in discovery and navigation on our site, again improving our users' experience, engagement, and the relevance of our content to them.
However, recommendation engines are not limited to movies, books, or products. The techniques we will explore in this article can be applied to just about any user-to-item relationship as well as user-to-user connections, such as those found on social networks, allowing us to make recommendations such as people you may know or who to follow.
In the immortal words of Steve Jobs - “a lot of times, people don’t know what they want until you show it to them.
The customer personalization journeys of Amazon and Netflix demonstrate just how powerful recommendation engines can be. See how these online giants built cutting edge recommendation engines that keep subscribers coming back for more.

Amazon

Netflix

Google Image Search
·        A recommendation engine can engage audiences with the right content
·        A recommendations engine can customize ads or sponsored content for a user based on their preferences
·        A recommendations engine for publishing website
Types of recommendation models
Recommender systems are widely studied, and there are many approaches used, but there are two that are probably most prevalent:
·        Content-based filtering
·        Collaborative filtering
Ø  Item-based collaborative filtering
Ø  User- based collaborative filtering
Content-based filtering
Assume a “real world” case: “John’s favourite cake is Napoleon (left picture below). He went to a shop for it, but such cakes were sold out. John asked a marketer to recommend something similar and was recommended a Napoleon torte (right picture below) that has the same ingredients. John bought it.”

This is an example of pure content-based filtering in the real world. The marketer has recommended the torte considering the ingredients similarity. A content-based filtering system has similar intuition behind it.
Content-based (CB) filtering systems are systems recommending items similar to items a user liked in the past.
Before we proceed, let me define a couple of terms:
  • Item would refer to content whose attributes are used in the recommender models. These could be movies, documents, book etc.
  • Attribute refers to the characteristic of an item. A movie tag, words in a document are examples.
These systems focus on algorithms, which assemble user’s preferences into user’s profiles and all items information into items’ profiles. Then they recommend those items close to the user by similarity of their profiles.
A user profile might be seen as a set of assigned keywords (terms, features) collected by algorithm from items found relevant (or interesting) by the user.
An item profile is a set of assigned keywords (terms, features) of the item itself.
Actual profiles building process is handled by various information retrieval or machine learning techniques. For instance, the most frequent terms in the document describing an item can represent the item’s profile.
Now the example can be reformulated in recommender terms: John liked cake Napoleon, its ingredients formed John’s user profile. The system reviewed other available item profiles and found that the most similar is the “torte Napoleon” item profile. The similarity is high because both cake and torte have the same ingredients. This was the reason for the recommendation.
The principal advantage of the content-based filtering approach is in its nature: it can start to recommend as soon as there is information about items available. The latter means that a recommender system does not require any user input to recommend.
How do Content Based Recommender Systems work?
A content based recommender works with data that the user provides, either explicitly (rating) or implicitly (clicking on a link). Based on that data, a user profile is generated, which is then used to make suggestions to the user. As the user provides more inputs or takes actions on the recommendations, the engine becomes more and more accurate.
Collaborative filtering
This is Collaborative Filtering (CF) approach – recommendations were given by others who have similar tastes in the past, but who already experienced an item yet unknown to the current user.
Collaborative filtering systems require users to express opinions on items. They collect opinions and recommend items based on people’s opinions similarity. Those who agree most are the contributors.
Now the example can be reformulated again: John asked a recommendation about “best fit” drink. Collaborative filtering system reviewed opinions only those from people who have tried and liked Napoleon torte in the past. The recommended “Mint tea” is merely the highly rated item among others by these people.
Collaborative filtering systems usually review more than just one common item to define a set of users, which influence results. For example, John should been tried many various cakes, and his friends also must tried the same cakes in past, to get better recommendation (Movielens requires at least 20 movies to be rated before it produces recommendations [Movielens.org])


Item-based collaborative filtering
Item based collaborative filtering is a model-based algorithm for recommender engines. In item based collaborative filtering similarities between items are calculated from rating-matrix. And based upon these similarities, user’s preference for an item not rated by him is calculated. Here is a step-by-step worked out example for four users and three items. We will consider the following sample data of preference of four users for three items:
ID
user
item
rating
241
u1
m1
2
222
u1
m3
3
276
u2
m1
5
273
u2
m2
2
200
u3
m1
3
229
u3
m2
3
231
u3
m3
1
239
u4
m2
2
286
u4
m3
2

Step 1: Write the user-item ratings data in a matrix form. The above table gets rewritten as follows:
Here rating of user u1 for item m3 is 3. There is no rating for item m2 by user u1. And no rating also for item m3 by user u2.

Step 2: We will now create an item-to-item similarity matrix. The idea is to calculate how similar an item is to another item. There are a number of ways of calculating this. We will use cosine similarity measure.  To calculate similarity between items m1 and m2, for example, look at all those users who have rated both these items. In our case, both m1 and m2 have been rated by users u2 and u3. We create two item-vectors, v1 for item m1 and v2 for item m2, in the user-space of (u2, u3) and then find the cosine of angle between these vectors. A zero angle or overlapping vectors with cosine value of 1 means total similarity (or per user, across all items, there is same rating) and an angle of 90 degree would mean cosine of 0 or no similarity. Thus, the two item-vectors would be,

            v1 = 5 u2 + 3 u3
            v2 = 3 u2 + 3 u3

The cosine similarity between the two vectors, v1 and v2, would then be:

             cos (v1,v2) = (5*3 + 3*3)/sqrt[(25 + 9)*(9+9)] = 0.76

Similarly, to calculate similarity between m1 and m3, we consider only users u1 and u3 who have rated both these items. The two item vectors, v1 for item m1 and v3 for item m3, in the user-space would be as follows:

             v1 = 2 u1 + 3 u3
             v3 = 3 u1 + 1 u3

The cosine similarity measure between v1 and v3 is:
             cos (v1,v3) = (2*3 + 3*1)/sqrt[(4 + 9)*(9+1)] = 0.78

We can similarly calculate similarity between items m2 and m3 using ratings given to both by users u3 and u4. The two item-vectors v3 and v4 would be:

             v2 = 3 u3 + 2 u4
             v3 = 1 u3 + 2 u4

And cosine similarity between them is:

             cos (v2,v3) = (3*1 + 2*2)/sqrt[(9 + 4)*(1 + 4)] = 0.86

We now have the complete item-to-item similarity matrix as follows:

Step 3: For each user, we next predict his ratings for items that he had not rated. We will calculate rating for user u1 in the case of item m2 (target item). To calculate this we weigh the just-calculated similarity-measure between the target item and other items that user has already rated. The weighing factor is the ratings given by the user to items already rated by him. We further scale this weighted sum with the sum of similarity-measures so that the calculated rating remains within a predefined limits. Thus, the predicted rating for item m2 for user u1 would be calculated using similarity measures between (m2, m1) and (m2, m3) weighted by the respective ratings for m1 and m3

                Rating = (2 * 0.76 + 3 * 0.86)/ (0.76+0.86) = 2.53

Recommender engine using item based collaborative filtering can be constructed using R package recommenderlab.
############### Collaborative filtering in R (Recommendation Engine) ################

R Script
Train Data Set for Model
Test Data Set for Model

# Set data path as per your data file (for example: "c://abc//" )
setwd("F:/Data Science/Data Science/Ashish/Recommendation Engine Dataset")

# If not installed, first install following three packages in R
#install.packages("recommenderlab")
library(recommenderlab)
library(reshape2)
library(ggplot2)
# Read training file along with header
tr<-read.csv("train_v2.csv",header=TRUE)
# Just look at first few lines of this file
head(tr)
# Remove 'id' column. We do not need it
tr<-tr[,-c(1)]
# Check, if removed
tr[tr$user==1,]
# Using acast to convert above data as follows:
# m1  m2   m3   m4
# u1    3   4    2    5
# u2    1   6    5
# u3    4   4    2    5
g<-acast(tr, user ~ movie)
# Check the class of g
class(g)

# Convert it as a matrix
R<-as.matrix(g)

# Convert R into realRatingMatrix data structure
#   realRatingMatrix is a recommenderlab sparse-matrix like data-structure
r <- as(R, "realRatingMatrix")
r

# view r in other possible ways
as(r, "list")     # A list
as(r, "matrix")   # A sparse matrix

# I can turn it into data-frame
head(as(r, "data.frame"))

# normalize the rating matrix
r_m <- normalize(r)
r_m
as(r_m, "list")

# Draw an image plot of raw-ratings & normalized ratings
#  A column represents one specific movie and ratings by users
#   are shaded.
#   Note that some items are always rated 'black' by most users
#    while some items are not rated by many users
#     On the other hand a few users always give high ratings
#      as in some cases a series of black dots cut across items
image(r, main = "Raw Ratings")      
image(r_m, main = "Normalized Ratings")

# Can also turn the matrix into a 0-1 binary matrix
r_b <- binarize(r, minRating=1)
as(r_b, "matrix")

# Create a recommender object (model)
#   Run anyone of the following four code lines.
#     Do not run all four
#       They pertain to four different algorithms.
#        UBCF: User-based collaborative filtering
#        IBCF: Item-based collaborative filtering
#      Parameter 'method' decides similarity measure
#        Cosine or Jaccard
rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Jaccard",nn=5, minRating=1))
rec=Recommender(r[1:nrow(r)],method="IBCF", param=list(normalize = "Z-score",method="Jaccard",minRating=1))
rec=Recommender(r[1:nrow(r)],method="POPULAR")

# Depending upon your selection, examine what you got
print(rec)
names(getModel(rec))
getModel(rec)$nn

############Create predictions#############################
# This prediction does not predict movie ratings for test.
#   But it fills up the user 'X' item matrix so that
#    for any userid and movieid, I can find predicted rating
#     dim(r) shows there are 6040 users (rows)
#      'type' parameter decides whether you want ratings or top-n items
#         get top-10 recommendations for a user, as:
#             predict(rec, r[1:nrow(r)], type="topNList", n=10)
recom <- predict(rec, r[1:nrow(r)], type="ratings")
recom

########## Examination of model & experimentation #############
########## This section can be skipped #########################

# Convert prediction into list, user-wise
as(recom, "list")
# Study and Compare the following:
as(r, "matrix")[1:10,1:10]      # Has lots of NAs. 'r' is the original matrix
as(recom, "matrix") # Is full of ratings. NAs disappear
as(recom, "matrix")[1:10,1:10] # Show ratings for all users for items 1 to 10
as(recom, "matrix")[5,3]   # Rating for user 5 for item at index 3
as.integer(as(recom, "matrix")[5,3]) # Just get the integer value
as.integer(round(as(recom, "matrix")[6039,8])) # Just get the correct integer value
as.integer(round(as(recom, "matrix")[368,3717]))

# Convert all your recommendations to list structure
rec_list<-as(recom,"list")
head(summary(rec_list))
# Access this list. User 2, item at index 2
rec_list[[2]][2]
rec_list[[1837]][4]
# Convert to data frame all recommendations for user 1
u1<-as.data.frame(rec_list[[1]])
attributes(u1)
class(u1)
head(u1)
# Create a column by name of id in data frame u1 and populate it with row names
u1$id<-row.names(u1)
# Check movie ratings are in column 1 of u1
u1
# Now access movie ratings in column 1 for u1
u1[u1$id==3952,]

########## Create submission File from model #######################
# Read test file
test<-read.csv("test_v2.csv",header=TRUE)
head(test)
# Get ratings list
rec_list<-as(recom,"list")
head(summary(rec_list))
ratings<-NULL
# For all lines in test file, one by one
for ( u in 1:length(test[,2]))
{
  # Read userid and movieid from columns 2 and 3 of test data
  userid <- test[u,2]
  movieid<-test[u,3]
 
  # Get as list & then convert to data frame all recommendations for user: userid
  u1<-as.data.frame(rec_list[[userid]])
  # Create a (second column) column-id in the data-frame u1 and populate it with row-names
  # Remember (or check) that rownames of u1 contain are by movie-ids
  # We use row.names() function
  u1$id<-row.names(u1)
  # Now access movie ratings in column 1 of u1
  x= u1[u1$id==movieid,1]
  # print(u)
  # print(length(x))
  # If no ratings were found, assign 0. You could also
  #   assign user-average
  if (length(x)==0)
  {
    ratings[u] <- 0
  }
  else
  {
    ratings[u] <-x
  }
 
}
length(ratings)
tx<-cbind(test[,1],round(ratings))
# Write to a csv file: submitfile.csv in your folder
write.table(tx,file="submitfile.csv",row.names=FALSE,col.names=FALSE,sep=',')


Monday, 9 January 2017

How Spark runs on clusters.

Spark Components
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

There are several useful things to note about this architecture:
1.  Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
2.  Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
3.  The driver program must listen for and accept incoming connections from its executors throughout its lifetime. As such, the driver program must be network addressable from the worker nodes.
4.  Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

Cluster Manager Types
The system currently supports three cluster managers:
·  Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
·  Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
·  Hadoop YARN – the resource manager in Hadoop 2.

Submitting Applications
Applications can be submitted to a cluster of any type using the spark-submit script. The application submission guide describes how to do this.

Monitoring
Each driver program has a web UI, typically on port 4040, that displays information about running tasks, executors, and storage usage. Simply go to http://<driver-node>:4040 in a web browser to access this UI. The monitoring guide also describes other monitoring options.

Job Scheduling
Spark gives control over resource allocation both across applications (at the level of the cluster manager) and within applications (if multiple computations are happening on the same SparkContext). The job scheduling overview describes this in more detail.

Glossary
The following table summarizes terms you’ll see used to refer to cluster concepts:

Term
Meaning
Application
User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar
A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program
The process running the main() function of the application and creating the SparkContext
Cluster manager
An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode
Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Worker node
Any node that can run application code in the cluster
Executor
A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task
A unit of work that will be sent to one executor
Job
A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save,collect); you'll see this term used in the driver's logs.
Stage
Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.