You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2010/10/14 22:00:00 UTC
[CONF] Apache Mahout > Itembased Collaborative Filtering
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Itembased Collaborative Filtering (https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering)
Added by Sebastian Schelter:
---------------------------------------------------------------------
Itembased Collaborative Filtering is a popular way of doing Recommendation Mining.
h3. Terminology
We have *users* that interact with *items* (which can be pretty anything like books, videos, news, other users,...). Those users express *preferences* towards the items which can either be boolean (just modelling that a user likes an item) or numeric (by having a rating value assigned to the preference). Typically only a small number of preferences is known for each single user.
h3. Algorithmic problems
Collaborative Filtering algorithms aim to solve the *prediction* problem where the task is to estimate the preference of a user towards an item which he/she has not yet seen.Item-Based Collaborative Filtering Recommendation Algorithms
Once algorithm can predict preferences it can also be used to do *Top-N-Recommendation* where the task is to find the N items a given user might like best. This is usually done by isolating a set of candidate items, computing the predicted preference of the given user towards them and returning the highest scoring ones.
If we look at the problem from a mathematical perspective, a *user-item-matrix* is created from the preference data and the task is to predict the missing entries by finding patterns in the known entries.
h3. Itembased Collaborative Filtering
A popular approach called "Itembased Collaborative Filtering" estimates a user's preference towards an item by looking at his/her preferences towards similar items, be aware that similarity must be thought of as similarity of rating behaviour not similarity of content in this context.
The standard procedure is to pairwisely compare the columns of the user-item-matrix (the item-vectors) using a similarity measure like pearson-correlation, cosine or loglikelihood to obtain similar items and use those together with a user ratings to predict his/her preference towards unknown items.
h3. Map/Reduce implementations
Mahout offers two Map/Reduce jobs aimed to support Itembased Collaborative Filtering.
*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* computes all similar items. It expects a .csv file with the preference data as input, where each line represents a single preference in the form _userID,itemID,value_ and outputs pairs of itemIDs with their associated similarity value.
{code}
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--similarityClassname (-s) similarityClassname Name of distributed similarity class to instantiate,
alternatively use one of the predefined similarities
(SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE,
SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
--maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem try to cap the number of similar items per item to this
number (default: 100)
--maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem try to cap the number of cooccurrences per item to this
number (default: 100)
--booleanData (-b) booleanData Treat input as without pref values
{code}
*org.apache.mahout.cf.taste.hadoop.item.RecommenderJob* is a completely distributed itembased recommender. It expects a .csv file with the preference data as input, where each line represents a single preference in the form _userID,itemID,value_ and outputs userIDs with associated recommended itemIDs and their scores.
{code}
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--numRecommendations (-n) numRecommendations Number of recommendations per user
--usersFile (-u) usersFile File of users to recommend for
--itemsFile (-i) itemsFile File of items to recommend for
--filterFile (-f) filterFile File containing comma-separated userID,itemID pairs. Used to
exclude the item from the recommendations for that user
(optional)
--booleanData (-b) booleanData Treat input as without pref values
--maxPrefsPerUser maxPrefsPerUser Maximum number of preferences considered per user in final
recommendation phase
--maxSimilaritiesPerItem maxSimilaritiesPerItem Maximum number of similarities considered per item
--maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem try to cap the number of cooccurrences per item to this
number (default: 100)
--similarityClassname (-s) similarityClassname Name of distributed similarity class to instantiate,
alternatively use one of the predefined similarities
(SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE,
SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
{code}
TODO: add more details
h3. Resources
* [Sarwar et al.:Item-Based Collaborative Filtering Recommendation Algorithms |http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf]
* [Slides: Distributed Itembased Collaborative Filtering with Apache Mahout|http://www.slideshare.net/sscdotopen/mahoutcf]
Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action