You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2011/03/12 12:37:59 UTC
[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006033#comment-13006033 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Attached a new version of the patch. I'd like to commit this one in the next days, if there are no objections (and no errors found). This patches removes some parts of the code that were highly memory intensive and hopefully enables tests with a higher number of features. It introduces a set of tools that might enable a first realworld usage of this algorithm:

* DatasetSplitter: split a rating dataset into training and probe parts
* ParallelALSFactorizationJob: parallel ALS-WR factorization of a rating matrix
* PredictionJob: predict preferences using the factorization of a rating matrix
* InMemoryFactorizationEvaluator: compute RMSE of a rating matrix factorization against probes in memory
* ParallelFactorizationEvaluator: compute RMSE of a rating matrix factorization against probes

There are still open points, in particular how to find a good regularization parameter automatically and efficiently and how to create an automated recommender pipeline similar to that of RecommenderJob using these tools. But I think these issues can be tackled in the future.

Here's how to play with the code:

{noformat}
# convert the movielens 1M dataset to mahout's common format for ratings
cat /path/to/ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > /path/to/ratings.csv

# create a 90% percent training set and a 10% probe set
bin/mahout splitDataset --input /path/to/ratings.csv --output /tmp/dataset --trainingPercentage 0.9 --probePercentage 0.1

# run distributed ALS-WR to factorize the rating matrix based on the training set
bin/mahout parallelALS --input /tmp/dataset/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065

# compute predictions against the probe set, measure the error
bin/mahout evaluateFactorizationParallel --output /tmp/als/rmse --pairs /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/

# print the error
cat /tmp/als/rmse/rmse.txt 
0.8531723318490103

# alternatively you can use the factorization to predict unknown ratings
bin/mahout predictFromFactorization --output /tmp/als/predict --pairs /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ --tempDir /tmp/als/predictTmp

# look at the predictions
cat /tmp/als/predict/part-r-*
1,150,4.0842405867880975
1,1029,4.163510579205656
1,745,3.7759166479388777
1,2294,3.495085673991081
1,938,3.6820865362790594
2,2067,3.8303249557251644
2,1090,3.954322089979675
2,1196,3.912089186677311
2,498,2.820740198815573
2,593,4.090550572202017
...
{noformat}

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netﬂix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira