You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sanjib Kumar Das <sa...@gmail.com> on 2010/11/19 22:34:03 UTC

Need for a distributed SVDRecommender

Hi All,

I wanted to run a distributed RecommenderJob with the SVDRecommender
implementation.
So i ran the pseudo.RecommenderJob with an
SVDRecommender(numFeatures=30,trainingSteps=50) on the 1M Movielens
data(6040 users). So this generated 10 recommendations for each of the 6040
users but took 14 hours to do so! My hadoop cluster had 12 m/cs. So i guess
it just ran multiple instances of the non-distributed SVD implementation and
each of these instances did the same thing again and again. So unless the
implementation of the recommender is distributed, we dont get any special
benefit with the pseudo.RecommenderJob.

But the item.RecommenderJob does the same 10 recommendations each for the
6040 users in 38 minutes. This is because it has an underlying distributed
implementation.

So my doubt is do we have a distributed SVDRecommender implementation? If
not, how should i go about writing one? Can I use the new LanczosSolver to
achieve this?

Thanks,
Sanjib

Re: Need for a distributed SVDRecommender

Posted by Sean Owen <sr...@gmail.com>.

I see, yes, the latter is actually distributed. They are very different
algorithms anyway.

On Fri, Nov 19, 2010 at 11:24 PM, Sanjib Kumar Das <sa...@gmail.com>wrote:

>  it takes 14 hrs to run the *pseudo*.RecommenderJob with the
> SVDRecommender.
> Ran the following command:
> hadoop jar recommender.jar
> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob
> -Dmapred.input.dir=testdata/ratings.csv -Dmapred.output.dir=outputBR
> --recommenderClassName
> org.apache.mahout.cf.taste.example.bucky.BuckyRecommender
>
> Here BuckyRecommender is SVDRecommender(30,50)
>
>
> it takes 38 minutes if I run the *item*.RecomenderJob with the following
> command :
> hadoop jar recommender.jar
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=testdata/ratings.csv -Dmapred.output.dir=output
>
> item.RecommenderJob is very different from pseudo.RecommenderJob (in terms
> of the distributed implementation) hence the difference in timings, i
> guess.
>
>
> On Fri, Nov 19, 2010 at 4:04 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > That result sounds confusing. It should take about the same number of
> > wall-clock hours either way. I don't see why it would take 14 hours --
> that
> > sounds wrong. If anything it should take 38 / N minutes where N is the
> > number of recommenders
> > you ran.
> >
> > SVDRecommender is not distributed at all, no.
> >
> > On Fri, Nov 19, 2010 at 9:34 PM, Sanjib Kumar Das <sanjib.kgp@gmail.com
> > >wrote:
> >
> > > Hi All,
> > >
> > > I wanted to run a distributed RecommenderJob with the SVDRecommender
> > > implementation.
> > > So i ran the pseudo.RecommenderJob with an
> > > SVDRecommender(numFeatures=30,trainingSteps=50) on the 1M Movielens
> > > data(6040 users). So this generated 10 recommendations for each of the
> > 6040
> > > users but took 14 hours to do so! My hadoop cluster had 12 m/cs. So i
> > guess
> > > it just ran multiple instances of the non-distributed SVD
> implementation
> > > and
> > > each of these instances did the same thing again and again. So unless
> the
> > > implementation of the recommender is distributed, we dont get any
> special
> > > benefit with the pseudo.RecommenderJob.
> > >
> > > But the item.RecommenderJob does the same 10 recommendations each for
> the
> > > 6040 users in 38 minutes. This is because it has an underlying
> > distributed
> > > implementation.
> > >
> > > So my doubt is do we have a distributed SVDRecommender implementation?
> If
> > > not, how should i go about writing one? Can I use the new LanczosSolver
> > to
> > > achieve this?
> > >
> > > Thanks,
> > > Sanjib
> > >
> >
>

Re: Need for a distributed SVDRecommender

Posted by Sean Owen <sr...@gmail.com>.

I see, yes, the latter is actually distributed. They are very different
algorithms anyway.

On Fri, Nov 19, 2010 at 11:24 PM, Sanjib Kumar Das <sa...@gmail.com>wrote:

>  it takes 14 hrs to run the *pseudo*.RecommenderJob with the
> SVDRecommender.
> Ran the following command:
> hadoop jar recommender.jar
> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob
> -Dmapred.input.dir=testdata/ratings.csv -Dmapred.output.dir=outputBR
> --recommenderClassName
> org.apache.mahout.cf.taste.example.bucky.BuckyRecommender
>
> Here BuckyRecommender is SVDRecommender(30,50)
>
>
> it takes 38 minutes if I run the *item*.RecomenderJob with the following
> command :
> hadoop jar recommender.jar
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=testdata/ratings.csv -Dmapred.output.dir=output
>
> item.RecommenderJob is very different from pseudo.RecommenderJob (in terms
> of the distributed implementation) hence the difference in timings, i
> guess.
>
>
> On Fri, Nov 19, 2010 at 4:04 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > That result sounds confusing. It should take about the same number of
> > wall-clock hours either way. I don't see why it would take 14 hours --
> that
> > sounds wrong. If anything it should take 38 / N minutes where N is the
> > number of recommenders
> > you ran.
> >
> > SVDRecommender is not distributed at all, no.
> >
> > On Fri, Nov 19, 2010 at 9:34 PM, Sanjib Kumar Das <sanjib.kgp@gmail.com
> > >wrote:
> >
> > > Hi All,
> > >
> > > I wanted to run a distributed RecommenderJob with the SVDRecommender
> > > implementation.
> > > So i ran the pseudo.RecommenderJob with an
> > > SVDRecommender(numFeatures=30,trainingSteps=50) on the 1M Movielens
> > > data(6040 users). So this generated 10 recommendations for each of the
> > 6040
> > > users but took 14 hours to do so! My hadoop cluster had 12 m/cs. So i
> > guess
> > > it just ran multiple instances of the non-distributed SVD
> implementation
> > > and
> > > each of these instances did the same thing again and again. So unless
> the
> > > implementation of the recommender is distributed, we dont get any
> special
> > > benefit with the pseudo.RecommenderJob.
> > >
> > > But the item.RecommenderJob does the same 10 recommendations each for
> the
> > > 6040 users in 38 minutes. This is because it has an underlying
> > distributed
> > > implementation.
> > >
> > > So my doubt is do we have a distributed SVDRecommender implementation?
> If
> > > not, how should i go about writing one? Can I use the new LanczosSolver
> > to
> > > achieve this?
> > >
> > > Thanks,
> > > Sanjib
> > >
> >
>

Re: Need for a distributed SVDRecommender

Posted by Sanjib Kumar Das <sa...@gmail.com>.

 it takes 14 hrs to run the *pseudo*.RecommenderJob with the SVDRecommender.
Ran the following command:
hadoop jar recommender.jar
org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob
-Dmapred.input.dir=testdata/ratings.csv -Dmapred.output.dir=outputBR
--recommenderClassName
org.apache.mahout.cf.taste.example.bucky.BuckyRecommender

Here BuckyRecommender is SVDRecommender(30,50)


it takes 38 minutes if I run the *item*.RecomenderJob with the following
command :
hadoop jar recommender.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=testdata/ratings.csv -Dmapred.output.dir=output

item.RecommenderJob is very different from pseudo.RecommenderJob (in terms
of the distributed implementation) hence the difference in timings, i guess.


On Fri, Nov 19, 2010 at 4:04 PM, Sean Owen <sr...@gmail.com> wrote:

> That result sounds confusing. It should take about the same number of
> wall-clock hours either way. I don't see why it would take 14 hours -- that
> sounds wrong. If anything it should take 38 / N minutes where N is the
> number of recommenders
> you ran.
>
> SVDRecommender is not distributed at all, no.
>
> On Fri, Nov 19, 2010 at 9:34 PM, Sanjib Kumar Das <sanjib.kgp@gmail.com
> >wrote:
>
> > Hi All,
> >
> > I wanted to run a distributed RecommenderJob with the SVDRecommender
> > implementation.
> > So i ran the pseudo.RecommenderJob with an
> > SVDRecommender(numFeatures=30,trainingSteps=50) on the 1M Movielens
> > data(6040 users). So this generated 10 recommendations for each of the
> 6040
> > users but took 14 hours to do so! My hadoop cluster had 12 m/cs. So i
> guess
> > it just ran multiple instances of the non-distributed SVD implementation
> > and
> > each of these instances did the same thing again and again. So unless the
> > implementation of the recommender is distributed, we dont get any special
> > benefit with the pseudo.RecommenderJob.
> >
> > But the item.RecommenderJob does the same 10 recommendations each for the
> > 6040 users in 38 minutes. This is because it has an underlying
> distributed
> > implementation.
> >
> > So my doubt is do we have a distributed SVDRecommender implementation? If
> > not, how should i go about writing one? Can I use the new LanczosSolver
> to
> > achieve this?
> >
> > Thanks,
> > Sanjib
> >
>

Re: Need for a distributed SVDRecommender

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Sanjib,

MAHOUT-542 uses a different algorithmic approach to factorize the matrix 
(as described in "Large-scale Parallel Collaborative Filtering for the 
Netflix Prize" 
http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf 
), it is not related to MAHOUT-371.

On 24.11.2010 07:28, Sanjib Kumar Das wrote:
>  From what I understand Mahout-371 tries to address the
> DistributedSVDRecommenderJob. Is it fully ready for use?
>
> @Sebastian : The above recommender uses the DistributedLanczosSolver to
> achieve the SVD. So, should the distributed Matrix Factorization(Mahout-542)
> you were talking about be integrated with it instead?
>
> I am slightly confused....
> On Fri, Nov 19, 2010 at 4:32 PM, Ted Dunning<te...@gmail.com>  wrote:
>
>    
>> On Fri, Nov 19, 2010 at 2:27 PM, Sebastian Schelter<ss...@apache.org>
>> wrote:
>>
>>      
>>> Can I use the new LanczosSolver to
>>>        
>>>>> achieve this?
>>>>>            
>>> The paper "Large-scale Parallel Collaborative Filtering for the Netﬂix
>>> Prize" says that you can't use Lanczos to factorize a rating matrix as
>>> it is only partially specified. However someone with more mathematical
>>> expertise than me should validate that statement, hope I didn't get that
>>> wrong :)
>>>
>>>        
>> You correctly quoted the statement.  But I don't think that the statement
>> is
>> entirely
>> correct.  The difference in practice isn't all that big a deal.
>>
>>
>>      
>>> Ted is working on LatentFactorLogLinear models in MAHOUT-525 which can
>>> be used for recommendations too and should be superior to the approach
>>> of MAHOUT-542. They're not distributed but in the paper in which they
>>> are described the authors state that they could train the 1M Movielens
>>> Dataset in 7 minutes so they should be fast enough for your testcase.
>>>
>>>        
>> This is where I would push for recommendations.  I have a preliminary
>> implementation
>> available on github, but I don't think it is ready to commit.  It does do
>> roughly what it
>> is supposed to do (on one test) but I don't have enough runtime with it to
>> have any
>> level of confidence yet.
>>
>>      
>

Re: Need for a distributed SVDRecommender

Posted by Sanjib Kumar Das <sa...@gmail.com>.

>From what I understand Mahout-371 tries to address the
DistributedSVDRecommenderJob. Is it fully ready for use?

@Sebastian : The above recommender uses the DistributedLanczosSolver to
achieve the SVD. So, should the distributed Matrix Factorization(Mahout-542)
you were talking about be integrated with it instead?

I am slightly confused....
On Fri, Nov 19, 2010 at 4:32 PM, Ted Dunning <te...@gmail.com> wrote:

> On Fri, Nov 19, 2010 at 2:27 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > Can I use the new LanczosSolver to
> > >> achieve this?
> >
> > The paper "Large-scale Parallel Collaborative Filtering for the Netﬂix
> > Prize" says that you can't use Lanczos to factorize a rating matrix as
> > it is only partially specified. However someone with more mathematical
> > expertise than me should validate that statement, hope I didn't get that
> > wrong :)
> >
>
> You correctly quoted the statement.  But I don't think that the statement
> is
> entirely
> correct.  The difference in practice isn't all that big a deal.
>
>
> > Ted is working on LatentFactorLogLinear models in MAHOUT-525 which can
> > be used for recommendations too and should be superior to the approach
> > of MAHOUT-542. They're not distributed but in the paper in which they
> > are described the authors state that they could train the 1M Movielens
> > Dataset in 7 minutes so they should be fast enough for your testcase.
> >
>
> This is where I would push for recommendations.  I have a preliminary
> implementation
> available on github, but I don't think it is ready to commit.  It does do
> roughly what it
> is supposed to do (on one test) but I don't have enough runtime with it to
> have any
> level of confidence yet.
>

Re: Need for a distributed SVDRecommender

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Nov 19, 2010 at 2:27 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Can I use the new LanczosSolver to
> >> achieve this?
>
> The paper "Large-scale Parallel Collaborative Filtering for the Netﬂix
> Prize" says that you can't use Lanczos to factorize a rating matrix as
> it is only partially specified. However someone with more mathematical
> expertise than me should validate that statement, hope I didn't get that
> wrong :)
>

You correctly quoted the statement.  But I don't think that the statement is
entirely
correct.  The difference in practice isn't all that big a deal.

> Ted is working on LatentFactorLogLinear models in MAHOUT-525 which can
> be used for recommendations too and should be superior to the approach
> of MAHOUT-542. They're not distributed but in the paper in which they
> are described the authors state that they could train the 1M Movielens
> Dataset in 7 minutes so they should be fast enough for your testcase.
>

This is where I would push for recommendations.  I have a preliminary
implementation
available on github, but I don't think it is ready to commit.  It does do
roughly what it
is supposed to do (on one test) but I don't have enough runtime with it to
have any
level of confidence yet.

Re: Need for a distributed SVDRecommender

Posted by Sebastian Schelter <ss...@apache.org>.

>> So my doubt is do we have a distributed SVDRecommender implementation? If
>> not, how should i go about writing one? 

The algorithm in MAHOUT-542 performs a distributed matrix factorization
that shall be used for recommendations one day. However it is in a very
early stage and it has not even been verified yet that the
implementation is working correctly. If you wanna help with it that
would be great.

Can I use the new LanczosSolver to
>> achieve this?

The paper "Large-scale Parallel Collaborative Filtering for the Netﬂix
Prize" says that you can't use Lanczos to factorize a rating matrix as
it is only partially specified. However someone with more mathematical
expertise than me should validate that statement, hope I didn't get that
wrong :)

Ted is working on LatentFactorLogLinear models in MAHOUT-525 which can
be used for recommendations too and should be superior to the approach
of MAHOUT-542. They're not distributed but in the paper in which they
are described the authors state that they could train the 1M Movielens
Dataset in 7 minutes so they should be fast enough for your testcase.

--sebastian

>>
>> Thanks,
>> Sanjib
>>
>

Re: Need for a distributed SVDRecommender

Posted by Sean Owen <sr...@gmail.com>.

That result sounds confusing. It should take about the same number of
wall-clock hours either way. I don't see why it would take 14 hours -- that
sounds wrong. If anything it should take 38 / N minutes where N is the
number of recommenders
you ran.

SVDRecommender is not distributed at all, no.

On Fri, Nov 19, 2010 at 9:34 PM, Sanjib Kumar Das <sa...@gmail.com>wrote:

> Hi All,
>
> I wanted to run a distributed RecommenderJob with the SVDRecommender
> implementation.
> So i ran the pseudo.RecommenderJob with an
> SVDRecommender(numFeatures=30,trainingSteps=50) on the 1M Movielens
> data(6040 users). So this generated 10 recommendations for each of the 6040
> users but took 14 hours to do so! My hadoop cluster had 12 m/cs. So i guess
> it just ran multiple instances of the non-distributed SVD implementation
> and
> each of these instances did the same thing again and again. So unless the
> implementation of the recommender is distributed, we dont get any special
> benefit with the pseudo.RecommenderJob.
>
> But the item.RecommenderJob does the same 10 recommendations each for the
> 6040 users in 38 minutes. This is because it has an underlying distributed
> implementation.
>
> So my doubt is do we have a distributed SVDRecommender implementation? If
> not, how should i go about writing one? Can I use the new LanczosSolver to
> achieve this?
>
> Thanks,
> Sanjib
>