You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Danny Bickson (JIRA)" <ji...@apache.org> on 2011/02/08 00:48:57 UTC
[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991718#comment-12991718 ] 

Danny Bickson commented on MAHOUT-542:
--------------------------------------

Hi,
Everything works now with the new patch (542-5). With the MovieLens 1M data everything works fine, I have tested with one, two and four slaves.
With Netflix data, I get the following exception:

2011-02-04 19:42:45,613 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_0: Error: GC overhead limit exceeded
2011-02-04 19:42:45,614 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_0' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 19:42:48,617 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_1' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 19:42:48,618 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_0' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 21:10:48,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_1: Error: GC overhead limit exceeded
2011-02-04 21:10:48,030 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_1' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_2' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_1' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_2: Error: GC overhead limit exceeded
2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_2' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 22:36:49,342 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_3' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 22:36:49,355 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_2' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'


Any ideas about how to fix this?

Thanks!!

Danny Bickson

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netﬂix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira