You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2010/10/14 22:00:00 UTC

[CONF] Apache Mahout > Itembased Collaborative Filtering

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Itembased Collaborative Filtering (https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering)

Added by Sebastian Schelter:
---------------------------------------------------------------------
Itembased Collaborative Filtering is a popular way of doing Recommendation Mining.

h3. Terminology

We have *users* that interact with *items* (which can be pretty anything like books, videos, news, other users,...). Those users express *preferences* towards the items which can either be boolean (just modelling that a user likes an item) or numeric (by having a rating value assigned to the preference). Typically only a small number of preferences is known for each single user.

h3. Algorithmic problems

Collaborative Filtering algorithms aim to solve the *prediction* problem where the task is to estimate the preference of a user towards an item which he/she has not yet seen.Item-Based Collaborative Filtering Recommendation Algorithms

Once algorithm can predict preferences it can also be used to do *Top-N-Recommendation* where the task is to find the N items a given user might like best. This is usually done by isolating a set of candidate items, computing the predicted preference of the given user towards them and returning the highest scoring ones.

If we look at the problem from a mathematical perspective, a *user-item-matrix* is created from the preference data and the task is to predict the missing entries by finding patterns in the known entries.

h3. Itembased Collaborative Filtering

A popular approach called "Itembased Collaborative Filtering" estimates a user's preference towards an item by looking at his/her preferences towards similar items, be aware that similarity must be thought of as similarity of rating behaviour not similarity of content in this context.

The standard procedure is to pairwisely compare the columns of the user-item-matrix (the item-vectors) using a similarity measure like pearson-correlation, cosine or loglikelihood to obtain similar items and use those together with a user ratings to predict his/her preference towards unknown items.


h3. Map/Reduce implementations

Mahout offers two Map/Reduce jobs aimed to support Itembased Collaborative Filtering.

*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* computes all similar items. It expects a .csv file with the preference data as input, where each line represents a single preference in the form _userID,itemID,value_ and outputs pairs of itemIDs with their associated similarity value.

{code}
  --input (-i) input                                        Path to job input directory.
  --output (-o) output                                      The directory pathname for output.
  --similarityClassname (-s) similarityClassname            Name of distributed similarity class to instantiate,
                                                            alternatively use one of the predefined similarities
                                                            (SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
                                                            SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
                                                            SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE,
                                                            SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
  --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem      try to cap the number of similar items per item to this
                                                            number (default: 100)
  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem    try to cap the number of cooccurrences per item to this
                                                            number (default: 100)
  --booleanData (-b) booleanData                            Treat input as without pref values
{code}

*org.apache.mahout.cf.taste.hadoop.item.RecommenderJob* is a completely distributed itembased recommender. It expects a .csv file with the preference data as input, where each line represents a single preference in the form _userID,itemID,value_ and outputs userIDs with associated recommended itemIDs and their scores.

{code}
  --input (-i) input                                        Path to job input directory.
  --output (-o) output                                      The directory pathname for output.
  --numRecommendations (-n) numRecommendations              Number of recommendations per user
  --usersFile (-u) usersFile                                File of users to recommend for
  --itemsFile (-i) itemsFile                                File of items to recommend for
  --filterFile (-f) filterFile                              File containing comma-separated userID,itemID pairs. Used to
                                                            exclude the item from the recommendations for that user
                                                            (optional)
  --booleanData (-b) booleanData                            Treat input as without pref values
  --maxPrefsPerUser maxPrefsPerUser                         Maximum number of preferences considered per user in final
                                                            recommendation phase
  --maxSimilaritiesPerItem maxSimilaritiesPerItem           Maximum number of similarities considered per item
  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem    try to cap the number of cooccurrences per item to this
                                                            number (default: 100)
  --similarityClassname (-s) similarityClassname            Name of distributed similarity class to instantiate,
                                                            alternatively use one of the predefined similarities
                                                            (SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
                                                            SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
                                                            SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE,
                                                            SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
{code}

TODO: add more details

h3. Resources

* [Sarwar et al.:Item-Based Collaborative Filtering Recommendation Algorithms |http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf]
* [Slides: Distributed Itembased Collaborative Filtering with Apache Mahout|http://www.slideshare.net/sscdotopen/mahoutcf]

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action