You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2010/11/13 23:17:13 UTC

[jira] Created: (MAHOUT-542) MapReduce implementation of ALS-WR

MapReduce implementation of ALS-WR
----------------------------------

                 Key: MAHOUT-542
                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
             Project: Mahout
          Issue Type: New Feature
          Components: Collaborative Filtering
    Affects Versions: 0.5
            Reporter: Sebastian Schelter


As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.

It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.

It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.

Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by Lance Norskog <go...@gmail.com>.
Pragati-

Please start a new thread with this question.

On 7/23/11, pragati meena (JIRA) <ji...@apache.org> wrote:
>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070018#comment-13070018
> ]
>
> pragati meena commented on MAHOUT-542:
> --------------------------------------
>
> hi sebastian,
>
> i am trying to run the example in windows using hadoop on cygwin , but i
> keep getting the following error ,even  though history file exists at
> same directory location
>
> Exception in thread "main" java.lang.IllegalStateException:
> java.io.FileNotFoundException: File does not exist:
> /user/hadoop/temp/errors/_logs
> at
> org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:73)
> at
> org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:67)
> at com.google.common.collect.Iterators$8.next(Iterators.java:730)
> at com.google.common.collect.Iterators$5.hasNext(Iterators.java:508)
> at
> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:40)
> at
> org.apache.mahout.utils.eval.ParallelFactorizationEvaluator.computeRmse(ParallelFactorizationEvaluator.java:111)
>
> Any ideas on how to fix this
>
> regards
>
> Pragati Meena
>
>> MapReduce implementation of ALS-WR
>> ----------------------------------
>>
>>                 Key: MAHOUT-542
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Collaborative Filtering
>>    Affects Versions: 0.5
>>            Reporter: Sebastian Schelter
>>            Assignee: Sebastian Schelter
>>             Fix For: 0.5
>>
>>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch,
>> MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch,
>> MAHOUT-542-6.patch, logs.zip
>>
>>
>> As Mahout is currently lacking a distributed collaborative filtering
>> algorithm that uses matrix factorization, I spent some time reading
>> through a couple of the Netflix papers and stumbled upon the "Large-scale
>> Parallel Collaborative Filtering for the Netflix Prize" available at
>> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
>> It describes a parallel algorithm that uses "Alternating-Least-Squares
>> with Weighted-λ-Regularization" to factorize the preference-matrix and
>> gives some insights on how the authors distributed the computation using
>> Matlab.
>> It seemed to me that this approach could also easily be parallelized using
>> Map/Reduce, so I sat down and created a prototype version. I'm not really
>> sure I got the mathematical details correct (they need some optimization
>> anyway), but I wanna put up my prototype implementation here per Yonik's
>> law of patches.
>> Maybe someone has the time and motivation to work a little on this with
>> me. It would be great if someone could validate the approach taken (I'm
>> willing to help as the code might not be intuitive to read) and could try
>> to factorize some test data and give feedback then.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>


-- 
Lance Norskog
goksron@gmail.com

Re: [jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by Ted Dunning <te...@gmail.com>.
On Tue, Jul 19, 2011 at 1:07 AM, Sebastian Schelter <ss...@apache.org> wrote:

> On 19.07.2011 01:32, Ted Dunning wrote:
>
>> Well... it makes it uncomputable in explicit form.  Sometimes there are
>> implicit forms for the matrix that keeps the size in bounds.  For
>> instance,
>> limited rank decompositions (aka truncated singular value decompositions)
>> can be represented by storing two skinny matrices and a diagonal that
>> don't
>> take much more memory/disk than the original data.
>>
>
> Aren't the user/item feature matrices already a form of this?
>

Yes.


> I think the basic question was how to compute recommendations for a
> particular user. You could just predict his preference for all items he has
> not yet seen by multiplying the item features matrix with his user feature
> vector to get the estimated preferences but you cannot to this for all users
> at once, right?
>

If you trim the recommendations to the top N, you probably can do this
computation.


> So the question would be how to find the initial "candidate items" from the
> item feature matrix.
>

Precisely put.  You can do this ahead of time and save vats of work or just
do the multiplication which will take a while to do, but you only need to
keep one dense vector at a time.

Re: [jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by Sebastian Schelter <ss...@apache.org>.
On 19.07.2011 01:32, Ted Dunning wrote:
> Well... it makes it uncomputable in explicit form.  Sometimes there are
> implicit forms for the matrix that keeps the size in bounds.  For instance,
> limited rank decompositions (aka truncated singular value decompositions)
> can be represented by storing two skinny matrices and a diagonal that don't
> take much more memory/disk than the original data.

Aren't the user/item feature matrices already a form of this?

I think the basic question was how to compute recommendations for a 
particular user. You could just predict his preference for all items he 
has not yet seen by multiplying the item features matrix with his user 
feature vector to get the estimated preferences but you cannot to this 
for all users at once, right?

So the question would be how to find the initial "candidate items" from 
the item feature matrix.

Hope I didn't misunderstand this.

--sebastian
>
> On Mon, Jul 18, 2011 at 12:17 AM, Sebastian Schelter (JIRA)<jira@apache.org
>> wrote:
>
>> The problem with this naive approach is that the resulting matrix is going
>> to be huge (millions of users times hundred thousands of items) and dense,
>> which makes it uncomputable.
>


Re: [jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by Ted Dunning <te...@gmail.com>.
Well... it makes it uncomputable in explicit form.  Sometimes there are
implicit forms for the matrix that keeps the size in bounds.  For instance,
limited rank decompositions (aka truncated singular value decompositions)
can be represented by storing two skinny matrices and a diagonal that don't
take much more memory/disk than the original data.

On Mon, Jul 18, 2011 at 12:17 AM, Sebastian Schelter (JIRA) <jira@apache.org
> wrote:

> The problem with this naive approach is that the resulting matrix is going
> to be huge (millions of users times hundred thousands of items) and dense,
> which makes it uncomputable.

Re: [jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by Ted Dunning <te...@gmail.com>.
With a line search like this, the evolutionary algorithm should be
reasonably friendly in terms of evaluations but the function being optimized
is almost certainly pretty friendly for a line search.

The only caveat is that the evolutionary algorithm will do better in the
presence of noise.  The line search can get fooled once and never recover.

On Tue, Dec 21, 2010 at 9:00 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> There's evolutionary algorithm in SGD to find those in adaptive way using
> cross-validation but it may be too demanding in terms of # of experiments.
> just FYI
>
> On Mon, Dec 20, 2010 at 11:58 PM, Sebastian Schelter <
> ssc.open@googlemail.com> wrote:
>
> > Hi Dmitriy,
> >
> > the paper states that it's easy to find a good lambda value with 3-4
> > experiments. I still have to verify that assumption on a real dataset.
> >
> > --sebastian
> >
> >
> > On 21.12.2010 00:57, Dmitriy Lyubimov wrote:
> >
> >>  HI Sebastian,
> >>
> >> how do you come up with a good Lambda to use with this weighted ALS?
> >>
> >> On Mon, Dec 20, 2010 at 3:27 PM, Sebastian Schelter (JIRA)
> >> <ji...@apache.org>wrote:
> >>
> >>       [
> >>>
> >>>
> https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> >>> ]
> >>>
> >>> Sebastian Schelter updated MAHOUT-542:
> >>> --------------------------------------
> >>>
> >>>     Attachment: MAHOUT-542-2.patch
> >>>
> >>> An updated version of the patch. I fixed a small bug, added more tests
> >>> and
> >>> polished the code a little.
> >>>
> >>> The distributed matrix factorization works fine now on a toy example.
> The
> >>> next steps will be to use real data and do some holdout tests.
> >>>
> >>> MapReduce implementation of ALS-WR
> >>>> ----------------------------------
> >>>>
> >>>>                 Key: MAHOUT-542
> >>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
> >>>>             Project: Mahout
> >>>>          Issue Type: New Feature
> >>>>          Components: Collaborative Filtering
> >>>>    Affects Versions: 0.5
> >>>>            Reporter: Sebastian Schelter
> >>>>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch
> >>>>
> >>>>
> >>>> As Mahout is currently lacking a distributed collaborative filtering
> >>>>
> >>> algorithm that uses matrix factorization, I spent some time reading
> >>> through
> >>> a couple of the Netflix papers and stumbled upon the "Large-scale
> >>> Parallel
> >>> Collaborative Filtering for the Netflix Prize" available at
> >>>
> >>>
> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
> >>> <
> >>>
> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf
> >
> >>>
> >>>
> >>> .
> >>>
> >>>> It describes a parallel algorithm that uses "Alternating-Least-Squares
> >>>>
> >>> with Weighted-λ-Regularization" to factorize the preference-matrix and
> >>> gives
> >>> some insights on how the authors distributed the computation using
> >>> Matlab.
> >>>
> >>>> It seemed to me that this approach could also easily be parallelized
> >>>>
> >>> using Map/Reduce, so I sat down and created a prototype version. I'm
> not
> >>> really sure I got the mathematical details correct (they need some
> >>> optimization anyway), but I wanna put up my prototype implementation
> here
> >>> per Yonik's law of patches.
> >>>
> >>>> Maybe someone has the time and motivation to work a little on this
> with
> >>>>
> >>> me. It would be great if someone could validate the approach taken (I'm
> >>> willing to help as the code might not be intuitive to read) and could
> try
> >>> to
> >>> factorize some test data and give feedback then.
> >>>
> >>> --
> >>> This message is automatically generated by JIRA.
> >>> -
> >>> You can reply to this email to add a comment to the issue online.
> >>>
> >>>
> >>>
> >
>

Re: [jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
There's evolutionary algorithm in SGD to find those in adaptive way using
cross-validation but it may be too demanding in terms of # of experiments.
just FYI

On Mon, Dec 20, 2010 at 11:58 PM, Sebastian Schelter <
ssc.open@googlemail.com> wrote:

> Hi Dmitriy,
>
> the paper states that it's easy to find a good lambda value with 3-4
> experiments. I still have to verify that assumption on a real dataset.
>
> --sebastian
>
>
> On 21.12.2010 00:57, Dmitriy Lyubimov wrote:
>
>>  HI Sebastian,
>>
>> how do you come up with a good Lambda to use with this weighted ALS?
>>
>> On Mon, Dec 20, 2010 at 3:27 PM, Sebastian Schelter (JIRA)
>> <ji...@apache.org>wrote:
>>
>>       [
>>>
>>> https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>> ]
>>>
>>> Sebastian Schelter updated MAHOUT-542:
>>> --------------------------------------
>>>
>>>     Attachment: MAHOUT-542-2.patch
>>>
>>> An updated version of the patch. I fixed a small bug, added more tests
>>> and
>>> polished the code a little.
>>>
>>> The distributed matrix factorization works fine now on a toy example. The
>>> next steps will be to use real data and do some holdout tests.
>>>
>>> MapReduce implementation of ALS-WR
>>>> ----------------------------------
>>>>
>>>>                 Key: MAHOUT-542
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>>>>             Project: Mahout
>>>>          Issue Type: New Feature
>>>>          Components: Collaborative Filtering
>>>>    Affects Versions: 0.5
>>>>            Reporter: Sebastian Schelter
>>>>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch
>>>>
>>>>
>>>> As Mahout is currently lacking a distributed collaborative filtering
>>>>
>>> algorithm that uses matrix factorization, I spent some time reading
>>> through
>>> a couple of the Netflix papers and stumbled upon the "Large-scale
>>> Parallel
>>> Collaborative Filtering for the Netflix Prize" available at
>>>
>>> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
>>> <
>>> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf>
>>>
>>>
>>> .
>>>
>>>> It describes a parallel algorithm that uses "Alternating-Least-Squares
>>>>
>>> with Weighted-λ-Regularization" to factorize the preference-matrix and
>>> gives
>>> some insights on how the authors distributed the computation using
>>> Matlab.
>>>
>>>> It seemed to me that this approach could also easily be parallelized
>>>>
>>> using Map/Reduce, so I sat down and created a prototype version. I'm not
>>> really sure I got the mathematical details correct (they need some
>>> optimization anyway), but I wanna put up my prototype implementation here
>>> per Yonik's law of patches.
>>>
>>>> Maybe someone has the time and motivation to work a little on this with
>>>>
>>> me. It would be great if someone could validate the approach taken (I'm
>>> willing to help as the code might not be intuitive to read) and could try
>>> to
>>> factorize some test data and give feedback then.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>>>
>

Re: [jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Dmitriy,

the paper states that it's easy to find a good lambda value with 3-4 
experiments. I still have to verify that assumption on a real dataset.

--sebastian

On 21.12.2010 00:57, Dmitriy Lyubimov wrote:
> HI Sebastian,
>
> how do you come up with a good Lambda to use with this weighted ALS?
>
> On Mon, Dec 20, 2010 at 3:27 PM, Sebastian Schelter (JIRA)
> <ji...@apache.org>wrote:
>
>>      [
>> https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Sebastian Schelter updated MAHOUT-542:
>> --------------------------------------
>>
>>      Attachment: MAHOUT-542-2.patch
>>
>> An updated version of the patch. I fixed a small bug, added more tests and
>> polished the code a little.
>>
>> The distributed matrix factorization works fine now on a toy example. The
>> next steps will be to use real data and do some holdout tests.
>>
>>> MapReduce implementation of ALS-WR
>>> ----------------------------------
>>>
>>>                  Key: MAHOUT-542
>>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-542
>>>              Project: Mahout
>>>           Issue Type: New Feature
>>>           Components: Collaborative Filtering
>>>     Affects Versions: 0.5
>>>             Reporter: Sebastian Schelter
>>>          Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch
>>>
>>>
>>> As Mahout is currently lacking a distributed collaborative filtering
>> algorithm that uses matrix factorization, I spent some time reading through
>> a couple of the Netflix papers and stumbled upon the "Large-scale Parallel
>> Collaborative Filtering for the Netflix Prize" available at
>> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf<http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf>
>> .
>>> It describes a parallel algorithm that uses "Alternating-Least-Squares
>> with Weighted-λ-Regularization" to factorize the preference-matrix and gives
>> some insights on how the authors distributed the computation using Matlab.
>>> It seemed to me that this approach could also easily be parallelized
>> using Map/Reduce, so I sat down and created a prototype version. I'm not
>> really sure I got the mathematical details correct (they need some
>> optimization anyway), but I wanna put up my prototype implementation here
>> per Yonik's law of patches.
>>> Maybe someone has the time and motivation to work a little on this with
>> me. It would be great if someone could validate the approach taken (I'm
>> willing to help as the code might not be intuitive to read) and could try to
>> factorize some test data and give feedback then.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>


Re: [jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
HI Sebastian,

how do you come up with a good Lambda to use with this weighted ALS?

On Mon, Dec 20, 2010 at 3:27 PM, Sebastian Schelter (JIRA)
<ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Sebastian Schelter updated MAHOUT-542:
> --------------------------------------
>
>     Attachment: MAHOUT-542-2.patch
>
> An updated version of the patch. I fixed a small bug, added more tests and
> polished the code a little.
>
> The distributed matrix factorization works fine now on a toy example. The
> next steps will be to use real data and do some holdout tests.
>
> > MapReduce implementation of ALS-WR
> > ----------------------------------
> >
> >                 Key: MAHOUT-542
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Collaborative Filtering
> >    Affects Versions: 0.5
> >            Reporter: Sebastian Schelter
> >         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch
> >
> >
> > As Mahout is currently lacking a distributed collaborative filtering
> algorithm that uses matrix factorization, I spent some time reading through
> a couple of the Netflix papers and stumbled upon the "Large-scale Parallel
> Collaborative Filtering for the Netflix Prize" available at
> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf<http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf>
> .
> > It describes a parallel algorithm that uses "Alternating-Least-Squares
> with Weighted-λ-Regularization" to factorize the preference-matrix and gives
> some insights on how the authors distributed the computation using Matlab.
> > It seemed to me that this approach could also easily be parallelized
> using Map/Reduce, so I sat down and created a prototype version. I'm not
> really sure I got the mathematical details correct (they need some
> optimization anyway), but I wanna put up my prototype implementation here
> per Yonik's law of patches.
> > Maybe someone has the time and motivation to work a little on this with
> me. It would be great if someone could validate the approach taken (I'm
> willing to help as the code might not be intuitive to read) and could try to
> factorize some test data and give feedback then.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988069#action_12988069 ] 

Danny Bickson commented on MAHOUT-542:
--------------------------------------

Hi again,
When I run the testing previous time (after removing AlternateLeastSquareSolver) I got the test error as described in my previous post (line 94 of the test org.apache.mahout.math.als.AlternateLeastSquaresSolverTest), fail() was called since exception was not thrown.

Did you make changes in the patch to address this problem?

Thanks!

DB

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991718#comment-12991718 ] 

Danny Bickson commented on MAHOUT-542:
--------------------------------------

Hi,
Everything works now with the new patch (542-5). With the MovieLens 1M data everything works fine, I have tested with one, two and four slaves.
With Netflix data, I get the following exception:

2011-02-04 19:42:45,613 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_0: Error: GC overhead limit exceeded
2011-02-04 19:42:45,614 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_0' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 19:42:48,617 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_1' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 19:42:48,618 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_0' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 21:10:48,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_1: Error: GC overhead limit exceeded
2011-02-04 21:10:48,030 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_1' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_2' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_1' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_2: Error: GC overhead limit exceeded
2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_2' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 22:36:49,342 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_3' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 22:36:49,355 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_2' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'


Any ideas about how to fix this?

Thanks!!

Danny Bickson

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Issue Comment Edited: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986825#action_12986825 ] 

Danny Bickson edited comment on MAHOUT-542 at 1/25/11 10:01 PM:
----------------------------------------------------------------

Hi!
I tried to install patch MAHOUT-542-3 against a clean svn checkout from Mahout trunk.
it seems that the file AlternateLeastSquaresSolver.java was already inserted into svn, so the code got doubled.
I tried to edit manually changes, and I attach a patch I created (named unifiedpatch)

The code compiles, but fails some tests.

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.mahout.math.jet.random.NormalTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.437 sec
Running org.apache.mahout.math.TestMatrixView
Tests run: 49, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.283 sec
Running org.apache.mahout.math.TestDenseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.134 sec
Running org.apache.mahout.math.TestSparseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.315 sec
Running org.apache.mahout.math.jet.stat.ProbabilityTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.162 sec
Running org.apache.mahout.math.VectorListTest
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.291 sec
Running org.apache.mahout.math.stats.LogLikelihoodTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.11 sec
Running org.apache.mahout.math.jet.random.engine.MersenneTwisterTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.049 sec
Running org.apache.mahout.math.decomposer.hebbian.TestHebbianSolver
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.071 sec
Running org.apache.mahout.math.TestDenseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.ExponentialTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.707 sec
Running org.apache.mahout.math.jet.stat.GammaTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.193 sec
Running org.apache.mahout.math.stats.OnlineSummarizerTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec
Running org.apache.mahout.math.VectorTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.048 sec
Running org.apache.mahout.common.RandomUtilsTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec
Running org.apache.mahout.math.TestSparseColumnMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.104 sec
Running org.apache.mahout.math.TestSequentialAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.05 sec
Running org.apache.mahout.math.QRDecompositionTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.045 sec
Running org.apache.mahout.math.TestSparseRowMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.09 sec
Running org.apache.mahout.math.TestRandomAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec
Running org.apache.mahout.math.TestOrderedIntDoubleMapping
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
Running org.apache.mahout.math.jet.random.NegativeBinomialTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.apache.mahout.math.TestVectorView
Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.GammaTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.72 sec
Running org.apache.mahout.math.TestSingularValueDecomposition
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
Running org.apache.mahout.math.decomposer.lanczos.TestLanczosSolver
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.875 sec

Results :

Failed tests:

createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest)

ubuntu@ip-10-195-226-63:/usr/local/mahout-0.4$ cat math/target/surefire-reports/org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.txt
-------------------------------------------------------------------------------
Test set: org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
-------------------------------------------------------------------------------
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest) Time elapsed: 0.002 sec  <<< FAILURE!
java.lang.AssertionError:
       at org.junit.Assert.fail(Assert.java:91)
       at org.junit.Assert.fail(Assert.java:98)
       at org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.createRiIiMaybeTransposedExceptionOnNonSequentialVector(AlternateLeastSquaresSolverTest.java:94)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
       at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
       at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
       at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
       at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
       at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
       at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
       at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
       at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
       at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)


Can you please take a look and instruct me what to do?

Thanks,

Danny Bickson


      was (Author: danny.bickson):
    Hi!
I tried to install patch MAHOUT-542 against a clean svn checkout from Mahout trunk.
it seems that the file AlternateLeastSquaresSolver.java was already inserted into svn, so the code got doubled.
I tried to edit manually changes, and I attach a patch I created (named unifiedpatch)

The code compiles, but fails some tests.

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.mahout.math.jet.random.NormalTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.437 sec
Running org.apache.mahout.math.TestMatrixView
Tests run: 49, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.283 sec
Running org.apache.mahout.math.TestDenseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.134 sec
Running org.apache.mahout.math.TestSparseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.315 sec
Running org.apache.mahout.math.jet.stat.ProbabilityTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.162 sec
Running org.apache.mahout.math.VectorListTest
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.291 sec
Running org.apache.mahout.math.stats.LogLikelihoodTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.11 sec
Running org.apache.mahout.math.jet.random.engine.MersenneTwisterTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.049 sec
Running org.apache.mahout.math.decomposer.hebbian.TestHebbianSolver
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.071 sec
Running org.apache.mahout.math.TestDenseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.ExponentialTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.707 sec
Running org.apache.mahout.math.jet.stat.GammaTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.193 sec
Running org.apache.mahout.math.stats.OnlineSummarizerTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec
Running org.apache.mahout.math.VectorTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.048 sec
Running org.apache.mahout.common.RandomUtilsTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec
Running org.apache.mahout.math.TestSparseColumnMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.104 sec
Running org.apache.mahout.math.TestSequentialAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.05 sec
Running org.apache.mahout.math.QRDecompositionTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.045 sec
Running org.apache.mahout.math.TestSparseRowMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.09 sec
Running org.apache.mahout.math.TestRandomAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec
Running org.apache.mahout.math.TestOrderedIntDoubleMapping
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
Running org.apache.mahout.math.jet.random.NegativeBinomialTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.apache.mahout.math.TestVectorView
Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.GammaTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.72 sec
Running org.apache.mahout.math.TestSingularValueDecomposition
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
Running org.apache.mahout.math.decomposer.lanczos.TestLanczosSolver
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.875 sec

Results :

Failed tests:

createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest)

ubuntu@ip-10-195-226-63:/usr/local/mahout-0.4$ cat math/target/surefire-reports/org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.txt
-------------------------------------------------------------------------------
Test set: org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
-------------------------------------------------------------------------------
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest) Time elapsed: 0.002 sec  <<< FAILURE!
java.lang.AssertionError:
       at org.junit.Assert.fail(Assert.java:91)
       at org.junit.Assert.fail(Assert.java:98)
       at org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.createRiIiMaybeTransposedExceptionOnNonSequentialVector(AlternateLeastSquaresSolverTest.java:94)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
       at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
       at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
       at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
       at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
       at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
       at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
       at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
       at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
       at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)


Can you please take a look and instruct me what to do?

Thanks,

Danny Bickson

  
> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, unifiedpatch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976238#action_12976238 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

{quote}
Can you compute factorizations for multiple values of lambda in one go and then evaluate all of them in one pass?

This would require that parallelALS accept a list of lambdas and produce multiple outputs
It would also require that evaluateALS accept multiple models as well.
{quote}

I don't think this could work with the distributed implementation as each iteration needs to see the feature vectors that have been computed in previous iteration with the same lambda value, so I don't see how to concurrently compute the values for several lambdas. My evaluation code currently depends on the feature matrices fitting into memory which might not always be the case, which could be another bottleneck. 

However in a non-distributed implementation this approach could work, would it help to pick a sample from the input data, try to find a near-optimal lambda on that in a non-distributed way and use that for the distributed computation too? Not sure on this.

Another issue I'm currently stuck on is how to compute the recommendations after the factorization. The rating matrix is factorized in the feature matrices U and M for users and items and a single rating can easily predicted by multiplying the feature vectors of the user and the item. If we wanna compute batch recommendations for all users we need to find an intelligent way to select "candidate items" to recommend for each users as we can't simply compute t(U) * M because those are dense matrices.

Our non-distributed SVDRecommender uses simple cooccurrence to identify those candidate items, so one way I could think of would be to use RowSimilarityJob to find cooccurring items and foreach user to only compute his ratings for items that cooccur with preferred ones. Not sure either if this is the best way to do this.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Work started] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on MAHOUT-542 started by Sebastian Schelter.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994874#comment-12994874 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Wow, that's really awesome news. Do you plan on doing more experiments?

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008464#comment-13008464 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

I have found a way to make the code do the factorization for multiple lambdas at once by making the intermediate writables carry an extra feature vector for each extra lambda. How do we proceed now, do we try to automatically find a near-optimal lambda or do we leave it up to the user to experiment a little?

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010530#comment-13010530 ] 

Hudson commented on MAHOUT-542:
-------------------------------

Integrated in Mahout-Quality #690 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/690/])
    MAHOUT-542 MapReduce implementation of ALS-WR


> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150471#comment-13150471 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Its an interesting idea. Please open a new jira issue for it, as this is already closed and was thought to only represent the initial implementation.
                
> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-542:
--------------------------------------

    Comment: was deleted

(was: The problem with this naive approach is that the resulting matrix is going to be huge (millions of users times hundred thousands of items) and dense, which makes it uncomputable.

I'm not aware of a general approach of computing recommendations from matrix decompositions, in the scientific literature these are only used for measuring the prediction error on held out data (as far as I know))

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018803#comment-13018803 ] 

Hudson commented on MAHOUT-542:
-------------------------------

Integrated in Mahout-Quality #742 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/742/])
    MAHOUT-542 missing evaluation classes


> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Fabian Alenius (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107458#comment-13107458 ] 

Fabian Alenius commented on MAHOUT-542:
---------------------------------------

Okay. I'll wait a bit to see if anyone objects.

I'm going to implement [Collaborative Filtering for Implicit Feedback Datasets|http://research.yahoo.com/pub/2433] and my plan was to try and reuse as much of the ParallelAlsFactorization code as possible. Do you see any issues with this?

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931787#action_12931787 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Hi Ted,

I read through that paper a while ago when we exchanged ideas for Mahout 0.5 on the mailing list and to be honest I didn't really get the mathematical details. Nevertheless I understood that its possibilities are for superior to what we currently have and also to the approach described in the Netflix paper mentioned above, mainly because of the ability to handle side information and nominal values as you already mentioned. I think the paper does not describe a parallelization approach to the algorithm, though I'm not sure whether this is even necessary for it.

But I had the prototype code attached in the patch ready before we had that discussion and I have the hope that it could be finished with only a little input from someone else so I decided to put it up here. I'd have no problem dropping this here though when MAHOUT-525 is done and it turns out that putting work in there would give a much nicer recommender for Mahout. 




> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Danny Bickson updated MAHOUT-542:
---------------------------------

    Attachment:     (was: unifiedpatch)

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010033#comment-13010033 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

I think so too, we should commit this now and reiterate for automatic lambda detection and optimization, are you ok with this Ted?

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Danny Bickson updated MAHOUT-542:
---------------------------------

    Attachment: MAHOUT-542-4.patch

Hi everyone,
Issue solved. I've created a new patch and tested it to work on movielens 1M dataset. On amaon EC2 machine it takes 1998 seconds (large instance).
With an accuracy of: RMSE: 0.8546120366924382, MAE: 0.6798083002225481

It seems that 3 files where already included in the patch while being in SVN.
Another issue was line 94 of math.als.AlternateLeastSquaresSolverTest.java 
which calls fail(). This killed the unit testing. Commenting it and running the rest of the command did not seem to affect the running results.

Please take a look into this.

Best, 


Danny Bickson


> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987175#action_12987175 ] 

Danny Bickson commented on MAHOUT-542:
--------------------------------------

Thanks Sebastian!

I have access to some large clusters, once you are done I will be happily help you test the current patch on Netflix or other larger datasets.

Best, 

- Danny

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, unifiedpatch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988091#action_12988091 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Did you apply the patch on the current trunk? Without any other of the patches attached here applied? I did this on my local machine, ran the tests and everything worked fine.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979265#action_12979265 ] 

Lance Norskog commented on MAHOUT-542:
--------------------------------------

bq. the factorization needs a regularization parameter called lambda, which heavily influences the quality of the result.

If you distribute the matrix portions, and run the lambda calculator in each job separately, how closely do they match? If they correlate well, would it work to have a bunch of similar lambdas?

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-542:
--------------------------------------

    Attachment: MAHOUT-542-5.patch

updated the patch to work with the current trunk. I only had to remove AlternateLeastSquaresSolver as this class was already committed.

btw: the call to fail() in the unit test is valid, it's used to make a test fail when a exception is not thrown that you expect to be thrown


It would be great if you use the current version of this patch to run the factorization on the netflix dataset with the parameters described in the paper. I'd love to see if we get roughly the same results.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-542:
--------------------------------------

    Attachment: MAHOUT-542-2.patch

An updated version of the patch. I fixed a small bug, added more tests and polished the code a little.

The distributed matrix factorization works fine now on a toy example. The next steps will be to use real data and do some holdout tests.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009633#comment-13009633 ] 

Sean Owen commented on MAHOUT-542:
----------------------------------

In terms of this issue, it sounds like you've made enough progress that it's worth committing so people can start to use it. Sounds like it works and you're just thinking about future optimizations.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066826#comment-13066826 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

The problem with this naive approach is that the resulting matrix is going to be huge (millions of users times hundred thousands of items) and dense, which makes it uncomputable.

I'm not aware of a general approach of computing recommendations from matrix decompositions, in the scientific literature these are only used for measuring the prediction error on held out data (as far as I know)

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994875#comment-12994875 ] 

Danny Bickson commented on MAHOUT-542:
--------------------------------------

Yes, I am doing some more experiments, because we comparing our system, GraphLab
http://www.graphlab.ml.cmu.edu/ to Mahout.

I am not sure if this is the right forum to update about performance, but I will be glad to
update anyone interested - email me...

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107508#comment-13107508 ] 

Ted Dunning commented on MAHOUT-542:
------------------------------------

Go for it.  The Hadoop API is very confused at the moment in any case.

Use whatever you like from 0.20.204

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931758#action_12931758 ] 

Ted Dunning commented on MAHOUT-542:
------------------------------------

Sebastian, 

Nice idea.  Can you say how it relates to Elkan and Menon's paper at http://arxiv.org/abs/1006.2156

The core of their algorithm is a matrix factorization, but they seem to get away with very few latent factors per user and are able to handle side information very nicely.

So it seems on the surface that these methods should be related.  One interesting difference is the use of a link function and the handling of ordinal target variables in Elkan and Menon's work.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006109#comment-13006109 ] 

Ted Dunning commented on MAHOUT-542:
------------------------------------

Sebastian,

I don't want to derail your commit, but your question about regularization suggested a thought to me.

One of the great advantages of the random projection methods over power law methods is due to the fact that iteration is so evil in Hadoop-base map-reduce, especially when you are simply reading the same input over and over.

With ALS-WR, you can run the program again for each value of regularization parameter, but there is really nothing except possibly memory size from running all of these optimizations at the same time.

How hard would that be, do you think, to interleave the computations for multiple values of regularization parameter into a single run of ALS-WR?


> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Danny Bickson updated MAHOUT-542:
---------------------------------

    Attachment: unifiedpatch

Hi!
I tried to install patch MAHOUT-542 against a clean svn checkout from Mahout trunk.
it seems that the file AlternateLeastSquaresSolver.java was already inserted into svn, so the code got doubled.
I tried to edit manually changes, and I attach a patch I created (named unifiedpatch)

The code compiles, but fails some tests.

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.mahout.math.jet.random.NormalTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.437 sec
Running org.apache.mahout.math.TestMatrixView
Tests run: 49, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.283 sec
Running org.apache.mahout.math.TestDenseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.134 sec
Running org.apache.mahout.math.TestSparseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.315 sec
Running org.apache.mahout.math.jet.stat.ProbabilityTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.162 sec
Running org.apache.mahout.math.VectorListTest
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.291 sec
Running org.apache.mahout.math.stats.LogLikelihoodTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.11 sec
Running org.apache.mahout.math.jet.random.engine.MersenneTwisterTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.049 sec
Running org.apache.mahout.math.decomposer.hebbian.TestHebbianSolver
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.071 sec
Running org.apache.mahout.math.TestDenseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.ExponentialTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.707 sec
Running org.apache.mahout.math.jet.stat.GammaTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.193 sec
Running org.apache.mahout.math.stats.OnlineSummarizerTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec
Running org.apache.mahout.math.VectorTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.048 sec
Running org.apache.mahout.common.RandomUtilsTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec
Running org.apache.mahout.math.TestSparseColumnMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.104 sec
Running org.apache.mahout.math.TestSequentialAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.05 sec
Running org.apache.mahout.math.QRDecompositionTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.045 sec
Running org.apache.mahout.math.TestSparseRowMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.09 sec
Running org.apache.mahout.math.TestRandomAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec
Running org.apache.mahout.math.TestOrderedIntDoubleMapping
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
Running org.apache.mahout.math.jet.random.NegativeBinomialTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.apache.mahout.math.TestVectorView
Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.GammaTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.72 sec
Running org.apache.mahout.math.TestSingularValueDecomposition
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
Running org.apache.mahout.math.decomposer.lanczos.TestLanczosSolver
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.875 sec

Results :

Failed tests:

createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest)

ubuntu@ip-10-195-226-63:/usr/local/mahout-0.4$ cat math/target/surefire-reports/org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.txt
-------------------------------------------------------------------------------
Test set: org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
-------------------------------------------------------------------------------
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest) Time elapsed: 0.002 sec  <<< FAILURE!
java.lang.AssertionError:
       at org.junit.Assert.fail(Assert.java:91)
       at org.junit.Assert.fail(Assert.java:98)
       at org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.createRiIiMaybeTransposedExceptionOnNonSequentialVector(AlternateLeastSquaresSolverTest.java:94)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
       at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
       at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
       at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
       at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
       at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
       at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
       at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
       at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
       at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)


Can you please take a look and instruct me what to do?

Thanks,

Danny Bickson


> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, unifiedpatch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-542:
--------------------------------------

    Attachment: MAHOUT-542-3.patch

Thanks for the input so far Ted and Dimitriy.

Here is an updated patch that does not address the issue of automatically learning lambda but provides some simple tools to evaluate the predicition quality of the factorization manually. 

I ran some local tests against the Movielens 1M dataset on my notebook:

{noformat}
# downloaded and converted the movielens 1M dataset to mahout's common format for ratings
cat /path/to/ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > /path/to/ratings.csv

# create a 90% percent training set and a 10% probe set
bin/mahout splitDataset --input /path/to/ratings.csv --output /tmp/dataset --trainingPercentage 0.9 --probePercentage 0.1

# run distributed ALS-WR to factorize the rating matrix based on the training set 
bin/mahout parallelALS --input /tmp/dataset/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065

# measure the error of the predictions against the probe set
bin/mahout evaluateALS --probes /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/
{noformat}

Gave RMSE of 0.8564062387241173 and MAE of 0.6791075767551951 in a testrun. 

Unfortunately I don't have a cluster available currently to test this so I couldn't use the Netflix dataset.... 

I still don't see how to automatically learn lambda yet without running lot's of subsequent M/R jobs...

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979182#action_12979182 ] 

Sean Owen commented on MAHOUT-542:
----------------------------------

Sebastian are you done with this? I know you added ALSWRFactorizer for example. Is that what this issue covered?

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986859#action_12986859 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Hi Danny,

you're right the latest patch is out of sync with the current trunk as AlternateLeastSquaresSolver was already committed for MAHOUT-572. I'm a little busy currently but I'll try to look at your patch this weekend, hope that's ok for you.

--sebastian

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, unifiedpatch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974436#action_12974436 ] 

Ted Dunning commented on MAHOUT-542:
------------------------------------

{quote}
I still don't see how to automatically learn lambda yet without running lot's of subsequent M/R jobs...
{quote}
Can you compute factorizations for multiple values of lambda in one go and then evaluate all of them in one pass?

This would require that parallelALS accept a list of lambdas and produce multiple outputs
It would also require that evaluateALS accept multiple models as well.

It should take about the same amount of time to test against lots of models as it does to test against a single model.  The distributed ALS might exhibit the same properties.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Danny Bickson updated MAHOUT-542:
---------------------------------

    Attachment: logs.zip

Hi!
I used two m2.xlarge EC2 machines.
I attach the program output, and also all the configuration files from the two machines and all the Hadoop logs. It seems that I got the GC exception again. Let me know if you identify a reason why I am getting out of memory.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979194#action_12979194 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

No, unfortunately there's still a lot of open work here. ALSWRFactorizer is just the non-distributed implementation of this algorithm.

In this issue I'm reaching for a distributed implementation.  The actual matrix factorization part is working, but there are some open problems:

 * the factorization needs a regularization parameter called lambda, which heavily influences the quality of the result. I don't see how to automatically find a near optimal lambda (which would be key requirement for providing a certain ease of use of this algorithm). I have some code in the works that can find a near optimal lambda in a non-distributed way, but I'm not sure whether my approach is mathematically correct, I will put it up for review when I'm done. 

* When we have the factorization we can easily estimate single preferences by computing the dot product of the user and item vectors from the factorization. However if this job here should produce recommendations for all users, we cannot naively multiply the transpose of the user features matrix with the item features matrix to estimate all possible preferences as these are dense matrices. We need to find a way to isolate a few candidate items per user, maybe by utilizing item cooccurrence. I'm not sure what's the best approach here either as this problem is not covered in the paper. 

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006746#comment-13006746 ] 

Ted Dunning commented on MAHOUT-542:
------------------------------------

Yes.

Keeping multiple versions in the computation is exactly what I had in mind.  Whether this is useful or not depends on whether each version of U and M are smaller than the original.  They certainly will be more efficient to read.  As such, reading them multiple times might be a win.

Or not.  I haven't worked out the details and that is what really, really matters here.


> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931894#action_12931894 ] 

Ted Dunning commented on MAHOUT-542:
------------------------------------

Sebastian, 

I very much did not want to decrease your very commendable momentum.  Getting a parallel version in place for comparison is very valuable.

I won't be able to look at your code due to my current constraints, but I think your general direction is very good.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Clive Cox (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066668#comment-13066668 ] 

Clive Cox commented on MAHOUT-542:
----------------------------------

An earlier comment said:
"...if this job here should produce recommendations for all users, we cannot naively multiply the transpose of the user features matrix with the item features matrix to estimate all possible preferences as these are dense matrices."

Can people explain further the problems with the naive approach?

How are people generally deriving recommendations from Matrix Factorization techniques? Get a neighnourhood from some other CF algorithm and then score only that neighnbourhood using the derived matrices?

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Issue Comment Edited: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992108#comment-12992108 ] 

Danny Bickson edited comment on MAHOUT-542 at 2/8/11 7:10 PM:
--------------------------------------------------------------

Hi!
I used two m2.xlarge EC2 machines.
I attach the program output (see attachment file logs.zip above), including all the Hadoop configuration files from the two machines and all the Hadoop logs. It seems that I got the GC exception again. Let me know if you identify a reason why I am getting out of memory.

Thanks, 

Danny Bickson

      was (Author: danny.bickson):
    Hi!
I used two m2.xlarge EC2 machines.
I attach the program output, and also all the configuration files from the two machines and all the Hadoop logs. It seems that I got the GC exception again. Let me know if you identify a reason why I am getting out of memory.
  
> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-542:
--------------------------------------

    Fix Version/s: 0.5
           Status: Patch Available  (was: In Progress)

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006350#comment-13006350 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Hi Ted,

I already spent some thinking on that, unfortunately no easy way to do that came on my mind. I'll share my line of thoughts, maybe there's some error in how I understand the algorithm:

The goal of ALS-WR is to factorize the rating matrix R into the user feature matrix U and the item feature matrix M. I tries to minimize the regularized squared error between R und U(T)M via an iterative algorithm. We start with a randomly initialized M, use this to compute an optimized U, fix U after that to compute an optimized M. We rotate between a fixed U and a fixed M until the error converges (or a maximum number of iterations is reached).

The question is now whether we can modify this process to use multiple lambda values (the regularization parameter used for the internal computations). You stated that we are "reading the same input over and over", which I don't see as all versions of U and M (with the exception of the first randomly initialized version of M) are dependent on the lambda that was used to produce them. So as I understand the algorithm, n values for lambda would not only mean n times the memory for computations is needed but also n versions of each U and M would need to be moved around. I think I could find a way to make the code do that (by maybe using something like a multi-version-vector that carries different results for different lambdas), but I'm not sure whether that's what you had in mind.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Assigned: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter reassigned MAHOUT-542:
-----------------------------------------

    Assignee: Sebastian Schelter

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991722#comment-12991722 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Can you share some more details about the machines you ran this on?

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "pragati meena (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070018#comment-13070018 ] 

pragati meena commented on MAHOUT-542:
--------------------------------------

hi sebastian, 

i am trying to run the example in windows using hadoop on cygwin , but i keep getting the following error ,even  though history file exists at 
same directory location 

Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File does not exist: /user/hadoop/temp/errors/_logs
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:73)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:67)
at com.google.common.collect.Iterators$8.next(Iterators.java:730)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:508)
at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:40)
at org.apache.mahout.utils.eval.ParallelFactorizationEvaluator.computeRmse(ParallelFactorizationEvaluator.java:111)

Any ideas on how to fix this 

regards 

Pragati Meena 

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991835#comment-12991835 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

I also think it's a memory problem. When recomputing user features the algorithm needs to look at the feature vectors of all movies the user has rated and when recomputing movie features the algorithm has to look at the feature vectors of all users which have rated this movie. I'm not very familiar with the netflix dataset but I think there might be some very popular movies that have been rated by lots of users and there might also be some "power" users that have rated lots of movies. So the memory consumption might get very high in some steps.

What kind of ec2 instances did you run this on? Did you use small instances? From my experience those are not very helpful, maybe you could retry this with large or c1.medium instances.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994872#comment-12994872 ] 

Danny Bickson commented on MAHOUT-542:
--------------------------------------

Problem solved - 
1)I have increased heap size to 4GB
2)Moved to a larger instance: m2.2xlarge
3)Increased children mappers memory to 2GB

One or more of those changes fixed the memory error.

One iteration using the full netflix data takes around 75 minutes.
RMSE looks good: 1.04 after one iteration.

- Danny

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Fabian Alenius (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107453#comment-13107453 ] 

Fabian Alenius commented on MAHOUT-542:
---------------------------------------

Hi, I was thinking of rewriting the itemRatings and userRatings job into one job using MultipleOutputs. Based on my understanding release 0.20.2* supports MultipleOutputs, although using deprecated APIS. Would such a patch be accepted or are there issues prohibiting such a change? 

What is the current target version of Hadoop?

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991811#comment-12991811 ] 

Lance Norskog commented on MAHOUT-542:
--------------------------------------

bq. GC overhead limit exceeded
This means that the JVM spends 98% of its time doing garbage collection. It is usually a sign of "not quite enough memory". There are other JVM parameters around garbage collection that you can tune: change the algorithm, change specific parameters. 'jvisualvm' or 'visualgc' is a free program in Java 1.6. You can grab onto the running Map/Reduce app and watch the garbage collection and memory allocation levels. 

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066825#comment-13066825 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

The problem with this naive approach is that the resulting matrix is going to be huge (millions of users times hundred thousands of items) and dense, which makes it uncomputable.

I'm not aware of a general approach of computing recommendations from matrix decompositions, in the scientific literature these are only used for measuring the prediction error on held out data (as far as I know)

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-542:
--------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Issue Comment Edited: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994875#comment-12994875 ] 

Danny Bickson edited comment on MAHOUT-542 at 2/15/11 4:48 PM:
---------------------------------------------------------------

Yes, I am doing some more experiments, because we are comparing our system, GraphLab
http://www.graphlab.ml.cmu.edu/ to Mahout.

I am not sure if this is the right forum to update about performance, but I will be glad to
update anyone interested - email me...

      was (Author: danny.bickson):
    Yes, I am doing some more experiments, because we comparing our system, GraphLab
http://www.graphlab.ml.cmu.edu/ to Mahout.

I am not sure if this is the right forum to update about performance, but I will be glad to
update anyone interested - email me...
  
> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Issue Comment Edited: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Danny Bickson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986825#action_12986825 ] 

Danny Bickson edited comment on MAHOUT-542 at 1/25/11 10:05 PM:
----------------------------------------------------------------

Hi!
I tried to install patch MAHOUT-542-3 against a clean svn checkout from Mahout trunk.
it seems that the file AlternateLeastSquaresSolver.java was already inserted into svn, so the code got doubled.
I tried to edit manually changes. I got several compilation errors. I attach a patch I created (named unifiedpatch)
after fixing the compilation errors.

The code compiles, but fails some tests.

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.mahout.math.jet.random.NormalTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.437 sec
Running org.apache.mahout.math.TestMatrixView
Tests run: 49, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.283 sec
Running org.apache.mahout.math.TestDenseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.134 sec
Running org.apache.mahout.math.TestSparseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.315 sec
Running org.apache.mahout.math.jet.stat.ProbabilityTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.162 sec
Running org.apache.mahout.math.VectorListTest
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.291 sec
Running org.apache.mahout.math.stats.LogLikelihoodTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.11 sec
Running org.apache.mahout.math.jet.random.engine.MersenneTwisterTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.049 sec
Running org.apache.mahout.math.decomposer.hebbian.TestHebbianSolver
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.071 sec
Running org.apache.mahout.math.TestDenseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.ExponentialTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.707 sec
Running org.apache.mahout.math.jet.stat.GammaTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.193 sec
Running org.apache.mahout.math.stats.OnlineSummarizerTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec
Running org.apache.mahout.math.VectorTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.048 sec
Running org.apache.mahout.common.RandomUtilsTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec
Running org.apache.mahout.math.TestSparseColumnMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.104 sec
Running org.apache.mahout.math.TestSequentialAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.05 sec
Running org.apache.mahout.math.QRDecompositionTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.045 sec
Running org.apache.mahout.math.TestSparseRowMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.09 sec
Running org.apache.mahout.math.TestRandomAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec
Running org.apache.mahout.math.TestOrderedIntDoubleMapping
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
Running org.apache.mahout.math.jet.random.NegativeBinomialTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.apache.mahout.math.TestVectorView
Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.GammaTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.72 sec
Running org.apache.mahout.math.TestSingularValueDecomposition
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
Running org.apache.mahout.math.decomposer.lanczos.TestLanczosSolver
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.875 sec

Results :

Failed tests:

createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest)

ubuntu@ip-10-195-226-63:/usr/local/mahout-0.4$ cat math/target/surefire-reports/org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.txt
-------------------------------------------------------------------------------
Test set: org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
-------------------------------------------------------------------------------
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest) Time elapsed: 0.002 sec  <<< FAILURE!
java.lang.AssertionError:
       at org.junit.Assert.fail(Assert.java:91)
       at org.junit.Assert.fail(Assert.java:98)
       at org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.createRiIiMaybeTransposedExceptionOnNonSequentialVector(AlternateLeastSquaresSolverTest.java:94)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
       at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
       at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
       at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
       at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
       at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
       at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
       at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
       at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
       at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)


Can you please take a look and instruct me what to do?

Thanks,

Danny Bickson


      was (Author: danny.bickson):
    Hi!
I tried to install patch MAHOUT-542-3 against a clean svn checkout from Mahout trunk.
it seems that the file AlternateLeastSquaresSolver.java was already inserted into svn, so the code got doubled.
I tried to edit manually changes, and I attach a patch I created (named unifiedpatch)

The code compiles, but fails some tests.

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.mahout.math.jet.random.NormalTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.437 sec
Running org.apache.mahout.math.TestMatrixView
Tests run: 49, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.283 sec
Running org.apache.mahout.math.TestDenseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.134 sec
Running org.apache.mahout.math.TestSparseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.315 sec
Running org.apache.mahout.math.jet.stat.ProbabilityTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.162 sec
Running org.apache.mahout.math.VectorListTest
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.291 sec
Running org.apache.mahout.math.stats.LogLikelihoodTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.11 sec
Running org.apache.mahout.math.jet.random.engine.MersenneTwisterTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.049 sec
Running org.apache.mahout.math.decomposer.hebbian.TestHebbianSolver
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.071 sec
Running org.apache.mahout.math.TestDenseMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.ExponentialTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.707 sec
Running org.apache.mahout.math.jet.stat.GammaTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.193 sec
Running org.apache.mahout.math.stats.OnlineSummarizerTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec
Running org.apache.mahout.math.VectorTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.048 sec
Running org.apache.mahout.common.RandomUtilsTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec
Running org.apache.mahout.math.TestSparseColumnMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.104 sec
Running org.apache.mahout.math.TestSequentialAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.05 sec
Running org.apache.mahout.math.QRDecompositionTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.045 sec
Running org.apache.mahout.math.TestSparseRowMatrix
Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.09 sec
Running org.apache.mahout.math.TestRandomAccessSparseVector
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec
Running org.apache.mahout.math.TestOrderedIntDoubleMapping
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
Running org.apache.mahout.math.jet.random.NegativeBinomialTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.apache.mahout.math.TestVectorView
Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
Running org.apache.mahout.math.jet.random.GammaTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.72 sec
Running org.apache.mahout.math.TestSingularValueDecomposition
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
Running org.apache.mahout.math.decomposer.lanczos.TestLanczosSolver
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.875 sec

Results :

Failed tests:

createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest)

ubuntu@ip-10-195-226-63:/usr/local/mahout-0.4$ cat math/target/surefire-reports/org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.txt
-------------------------------------------------------------------------------
Test set: org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
-------------------------------------------------------------------------------
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest) Time elapsed: 0.002 sec  <<< FAILURE!
java.lang.AssertionError:
       at org.junit.Assert.fail(Assert.java:91)
       at org.junit.Assert.fail(Assert.java:98)
       at org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.createRiIiMaybeTransposedExceptionOnNonSequentialVector(AlternateLeastSquaresSolverTest.java:94)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
       at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
       at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
       at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
       at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
       at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
       at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
       at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
       at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
       at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:616)
       at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
       at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)


Can you please take a look and instruct me what to do?

Thanks,

Danny Bickson

  
> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, unifiedpatch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107454#comment-13107454 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Current hadoop version is 0.20.204.0

I don't think that the ALS-WR is really used at the moment and I don't think it fits the MapReduce paradigm very well, so you'd have my go in playing with the code. We still use the deprecated APIs for a lot of our distributed linear algebra stuff, so I'd be okay with it. But I'd like to hear other opinions first.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-542:
--------------------------------------

    Attachment: MAHOUT-542-6.patch

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010212#comment-13010212 ] 

Ted Dunning commented on MAHOUT-542:
------------------------------------

Progress always sounds good to me.  It is even better when there is a plan for further progress.

I say ship it!

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-542:
--------------------------------------

    Attachment: MAHOUT-452.patch

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006033#comment-13006033 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Attached a new version of the patch. I'd like to commit this one in the next days, if there are no objections (and no errors found). This patches removes some parts of the code that were highly memory intensive and hopefully enables tests with a higher number of features. It introduces a set of tools that might enable a first realworld usage of this algorithm:

* DatasetSplitter: split a rating dataset into training and probe parts
* ParallelALSFactorizationJob: parallel ALS-WR factorization of a rating matrix
* PredictionJob: predict preferences using the factorization of a rating matrix
* InMemoryFactorizationEvaluator: compute RMSE of a rating matrix factorization against probes in memory
* ParallelFactorizationEvaluator: compute RMSE of a rating matrix factorization against probes

There are still open points, in particular how to find a good regularization parameter automatically and efficiently and how to create an automated recommender pipeline similar to that of RecommenderJob using these tools. But I think these issues can be tackled in the future.

Here's how to play with the code:

{noformat}
# convert the movielens 1M dataset to mahout's common format for ratings
cat /path/to/ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > /path/to/ratings.csv

# create a 90% percent training set and a 10% probe set
bin/mahout splitDataset --input /path/to/ratings.csv --output /tmp/dataset --trainingPercentage 0.9 --probePercentage 0.1

# run distributed ALS-WR to factorize the rating matrix based on the training set
bin/mahout parallelALS --input /tmp/dataset/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065

# compute predictions against the probe set, measure the error
bin/mahout evaluateFactorizationParallel --output /tmp/als/rmse --pairs /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/

# print the error
cat /tmp/als/rmse/rmse.txt 
0.8531723318490103

# alternatively you can use the factorization to predict unknown ratings
bin/mahout predictFromFactorization --output /tmp/als/predict --pairs /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ --tempDir /tmp/als/predictTmp

# look at the predictions
cat /tmp/als/predict/part-r-*
1,150,4.0842405867880975
1,1029,4.163510579205656
1,745,3.7759166479388777
1,2294,3.495085673991081
1,938,3.6820865362790594
2,2067,3.8303249557251644
2,1090,3.954322089979675
2,1196,3.912089186677311
2,498,2.820740198815573
2,593,4.090550572202017
...
{noformat}

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150419#comment-13150419 ] 

Raphael Cendrillon commented on MAHOUT-542:
-------------------------------------------

I'd like to get more involved in contributing to Mahout. In particular if there's any area you need support regarding ALS-WR or other topics as well I'd be very happy to lend a hand. 

In particular I was quite interested in your comments on automatically finding a good setting for lambda. I'm wondering whether something more sophisticated could be done than exhaustive search, for example if the loss function evaluated on the hold-out dataset is a convex function of lambda then gradient descent (or quasi-Newton methods) could be used.
                
> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (MAHOUT-542) MapReduce implementation of ALS-WR

Posted by "Alvin AuYoung (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118267#comment-13118267 ] 

Alvin AuYoung commented on MAHOUT-542:
--------------------------------------

Hi Sebastian,

First of all, many thanks for contributing this ALS implementation. It's very useful. Like others on this list, I'm trying to run some experiments on it using the Netflix data, but I'm seeing an error I am having trouble diagnosing. After completing the first 4 jobs, reduce copiers are failing for the 5th job (Mapper-SolvingReducer). I'm running on hadoop-0.20.2 and checked out mahout from the trunk, so I believe any patches you've mentioned should be incorporated. 

Here is a description of the job I'm running: MAHOUT-JOB: /home/auyoung/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/09/30 01:20:56 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=training_all_triplets_norm, --lambda=0.065, --numFeatures=25, --numIterations=5, --output=als.out, --startPhase=0, --tempDir=temp}

Do you have any ideas what might be wrong? I'm running it on a physical cluster of 20 slaves, each with 2 mappers and reducers, and there is > 8 GB memory (per jvm), > 2 GB HADOOP_HEAPSIZE, and the maximum allowable io.sort.mb of 2047. Also, there is plenty of disk space remaining. Here is a transcript of one of the several failures on the ParallelALSFactorizationJob-Mapper-SolvingReducer:

2011-09-30 02:05:37,115 INFO org.apache.hadoop.mapred.Merger: Merging 16 sorted segments
2011-09-30 02:05:37,115 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 16 segments left of total size: 1039493457 bytes
2011-09-30 02:05:37,116 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201109300120_0005_r_000000_0 Merge of the inmemory files threw an exception: java.io.IOException: Intermediate merge failed
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2576)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
Caused by: java.lang.RuntimeException: java.io.EOFException
	at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
	at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:373)
	at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:136)
	at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103)
	at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:335)
	at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
	at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2560)
	... 1 more
Caused by: java.io.EOFException
	at java.io.DataInputStream.readByte(DataInputStream.java:250)
	at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
	at org.apache.mahout.math.Varint.readSignedVarInt(Varint.java:140)
	at org.apache.mahout.cf.taste.hadoop.als.IndexedVarIntWritable.readFields(IndexedVarIntWritable.java:64)
	at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:97)
	... 8 more

2011-09-30 02:05:37,116 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201109300120_0005_r_000000_0 Merging of the local FS files threw an exception: java.io.IOException: java.lang.RuntimeException: java.io.EOFException
	at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
	at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:373)
	at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:139)
	at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103)
	at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:335)
	at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
	at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2454)
Caused by: java.io.EOFException
	at java.io.DataInputStream.readByte(DataInputStream.java:250)
	at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
	at org.apache.mahout.math.Varint.readSignedVarInt(Varint.java:140)
	at org.apache.mahout.cf.taste.hadoop.als.IndexedVarIntWritable.readFields(IndexedVarIntWritable.java:64)
	at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:100)
	... 7 more

	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2458)

Thanks,

Alvin

                
> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.
> Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira