You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Han JU <ju...@gmail.com> on 2013/03/18 17:39:24 UTC

ALS-WR on Million Song dataset

Hi,

I'm wondering has someone tried ParallelALS with implicite feedback job on
million song dataset? Some pointers on alpha and lambda?

In the paper alpha is 40 and lambda is 150, but I don't know what are their
r values in the matrix. They said is based on time units that users have
watched the show, so may be it's big.

Many thanks!
-- 
*JU Han*

UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: ALS-WR on Million Song dataset

Posted by Sebastian Schelter <ss...@apache.org>.
Nice to hear that!

In order to get into the code, I suggest you first read the papers
regarding ALS for Collaborative Filtering:

Large-scale Parallel Collaborative Filtering for the Netflix Prize
http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf

Collaborative Filtering for Implicit Feedback Datasets
http://research.yahoo.com/pub/2433</p>

There's also a slideset available from a university lecture from my
department that has a few slides about ALS:

http://de.slideshare.net/sscdotopen/latent-factor-models-for-collaborative-filtering

After that, you should try to go through
org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob

On 21.03.2013 15:58, Han JU wrote:
> Hi Sebastian,
> 
> It runs much faster! On the same recommendation task it now terminates in
> 45min rather than ~2h yesterday. I think with further tuning it can be even
> faster.
> 
> I'm trying reading the code, some hints for a good starting point?
> 
> Thanks a lot!
> 
> 
> 2013/3/20 Sebastian Schelter <ss...@apache.org>
> 
>> Hi JU,
>>
>> I reworked the RecommenderJob in a similar way as the ALS job. Can you
>> give it a try?
>>
>> You have to try the patch from
>> https://issues.apache.org/jira/browse/MAHOUT-1169
>>
>> In introduces a new param to RecommenderJob called --numThreads. The
>> configuration of the job should be done similar to the ALS job.
>>
>> /s
>>
>>
>> On 20.03.2013 12:38, Han JU wrote:
>>> Thanks again Sebastian and Seon, I set -Xmx4000m for
>> mapred.child.java.opts
>>> and 8 threads for each mapper. Now the job runs smoothly and the whole
>>> factorization ends in 45min. With your settings I think it should be even
>>> faster.
>>>
>>> One more thing is that the RecommendJob is kind of slow (for all users).
>>> For example I want to have a list of top 500 items to recommend. Any
>>> pointers about how to modify the job code so that it can consult a file
>>> then calculates recommendations only for the users id in that file?
>>>
>>>
>>> 2013/3/20 Han JU <ju...@gmail.com>
>>>
>>>> Hi Sebastian,
>>>>
>>>> I've tried the svn trunk. Hadoop constantly complains about memory like
>>>> "out of memory error".
>>>> On the datanode there's 4 physic cores and by hyper-threading it has 16
>>>> logical cores, so I set --numThreadsPerSolver to 16 and that seems to
>> have
>>>> a problem with memory.
>>>> How you set your mapred.child.java.opts? Given that we allow only one
>>>> mapper so that should be nearly the whole size of system memory?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> 2013/3/19 Sebastian Schelter <ss...@apache.org>
>>>>
>>>>> Hi JU,
>>>>>
>>>>> We recently rewrote the factorization code, it should be much faster
>>>>> now. You should use the current trunk, make Hadoop schedule only one
>>>>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1),
>> make
>>>>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>>>>> number of cores that you want to use per machine (use all if you can).
>>>>>
>>>>> I got astonishing results running the code like this on a 26 machines
>>>>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs
>> dataset
>>>>> (700M datapoints).
>>>>>
>>>>> Let me know if you need more information.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> On 19.03.2013 15:31, Han JU wrote:
>>>>>> Thanks Sebastian and Sean, I will dig more into the paper.
>>>>>> With a simple try on a small part of the data, it seems larger alpha
>>>>> (~40)
>>>>>> gets me a better result.
>>>>>> Do you have an idea how long it will be for ParellelALS for the 700mb
>>>>>> complete dataset? It contains ~48 million triples. The hadoop cluster
>> I
>>>>>> dispose is of 5 nodes and can factorize the movieLens 10M in about
>>>>> 13min.
>>>>>>
>>>>>>
>>>>>> 2013/3/18 Sebastian Schelter <ss...@apache.org>
>>>>>>
>>>>>>> You should also be aware that the alpha parameter comes from a
>> formula
>>>>>>> the authors introduce to measure the "confidence" in the observed
>>>>> values:
>>>>>>>
>>>>>>> confidence = 1 + alpha * observed_value
>>>>>>>
>>>>>>> You can also change that formula in the code to something that you
>> see
>>>>>>> more fit, the paper even suggests alternative variants.
>>>>>>>
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>>
>>>>>>>
>>>>>>> On 18.03.2013 18:06, Han JU wrote:
>>>>>>>> Thanks for quick responses.
>>>>>>>>
>>>>>>>> Yes it's that dataset. What I'm using is triplets of "user_id
>> song_id
>>>>>>>> play_times", of ~ 1m users. No audio things, just plein text
>> triples.
>>>>>>>>
>>>>>>>> It seems to me that the paper about "implicit feedback" matchs well
>>>>> this
>>>>>>>> dataset: no explicit ratings, but times of listening to a song.
>>>>>>>>
>>>>>>>> Thank you Sean for the alpha value, I think they use big numbers is
>>>>>>> because
>>>>>>>> their values in the R matrix is big.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
>>>>>>>>
>>>>>>>>> JU,
>>>>>>>>>
>>>>>>>>> are you refering to this dataset?
>>>>>>>>>
>>>>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>>>>>>
>>>>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>>>>>>> One word of caution, is that there are at least two papers on ALS
>>>>> and
>>>>>>>>> they
>>>>>>>>>> define lambda differently. I think you are talking about
>>>>> "Collaborative
>>>>>>>>>> Filtering for Implicit Feedback Datasets".
>>>>>>>>>>
>>>>>>>>>> I've been working with some folks who point out that alpha=40
>> seems
>>>>> to
>>>>>>> be
>>>>>>>>>> too high for most data sets. After running some tests on common
>> data
>>>>>>>>> sets,
>>>>>>>>>> alpha=1 looks much better. YMMV.
>>>>>>>>>>
>>>>>>>>>> In the end you have to evaluate these two parameters, and the # of
>>>>>>>>>> features, across a range to determine what's best.
>>>>>>>>>>
>>>>>>>>>> Is this data set not a bunch of audio features? I am not sure it
>>>>> works
>>>>>>>>> for
>>>>>>>>>> ALS, not naturally at least.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I'm wondering has someone tried ParallelALS with implicite
>> feedback
>>>>>>> job
>>>>>>>>> on
>>>>>>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>>>>>>
>>>>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
>>>>> are
>>>>>>>>> their
>>>>>>>>>>> r values in the matrix. They said is based on time units that
>> users
>>>>>>> have
>>>>>>>>>>> watched the show, so may be it's big.
>>>>>>>>>>>
>>>>>>>>>>> Many thanks!
>>>>>>>>>>> --
>>>>>>>>>>> *JU Han*
>>>>>>>>>>>
>>>>>>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>>>>>>
>>>>>>>>>>> +33 0619608888
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *JU Han*
>>>>
>>>> Software Engineer Intern @ KXEN Inc.
>>>> UTC   -  Université de Technologie de Compiègne
>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>
>>>> +33 0619608888
>>>>
>>>
>>>
>>>
>>
>>
> 
> 


Re: ALS-WR on Million Song dataset

Posted by Han JU <ju...@gmail.com>.
Hi Sebastian,

It runs much faster! On the same recommendation task it now terminates in
45min rather than ~2h yesterday. I think with further tuning it can be even
faster.

I'm trying reading the code, some hints for a good starting point?

Thanks a lot!


2013/3/20 Sebastian Schelter <ss...@apache.org>

> Hi JU,
>
> I reworked the RecommenderJob in a similar way as the ALS job. Can you
> give it a try?
>
> You have to try the patch from
> https://issues.apache.org/jira/browse/MAHOUT-1169
>
> In introduces a new param to RecommenderJob called --numThreads. The
> configuration of the job should be done similar to the ALS job.
>
> /s
>
>
> On 20.03.2013 12:38, Han JU wrote:
> > Thanks again Sebastian and Seon, I set -Xmx4000m for
> mapred.child.java.opts
> > and 8 threads for each mapper. Now the job runs smoothly and the whole
> > factorization ends in 45min. With your settings I think it should be even
> > faster.
> >
> > One more thing is that the RecommendJob is kind of slow (for all users).
> > For example I want to have a list of top 500 items to recommend. Any
> > pointers about how to modify the job code so that it can consult a file
> > then calculates recommendations only for the users id in that file?
> >
> >
> > 2013/3/20 Han JU <ju...@gmail.com>
> >
> >> Hi Sebastian,
> >>
> >> I've tried the svn trunk. Hadoop constantly complains about memory like
> >> "out of memory error".
> >> On the datanode there's 4 physic cores and by hyper-threading it has 16
> >> logical cores, so I set --numThreadsPerSolver to 16 and that seems to
> have
> >> a problem with memory.
> >> How you set your mapred.child.java.opts? Given that we allow only one
> >> mapper so that should be nearly the whole size of system memory?
> >>
> >> Thanks!
> >>
> >>
> >> 2013/3/19 Sebastian Schelter <ss...@apache.org>
> >>
> >>> Hi JU,
> >>>
> >>> We recently rewrote the factorization code, it should be much faster
> >>> now. You should use the current trunk, make Hadoop schedule only one
> >>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1),
> make
> >>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
> >>> number of cores that you want to use per machine (use all if you can).
> >>>
> >>> I got astonishing results running the code like this on a 26 machines
> >>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs
> dataset
> >>> (700M datapoints).
> >>>
> >>> Let me know if you need more information.
> >>>
> >>> Best,
> >>> Sebastian
> >>>
> >>> On 19.03.2013 15:31, Han JU wrote:
> >>>> Thanks Sebastian and Sean, I will dig more into the paper.
> >>>> With a simple try on a small part of the data, it seems larger alpha
> >>> (~40)
> >>>> gets me a better result.
> >>>> Do you have an idea how long it will be for ParellelALS for the 700mb
> >>>> complete dataset? It contains ~48 million triples. The hadoop cluster
> I
> >>>> dispose is of 5 nodes and can factorize the movieLens 10M in about
> >>> 13min.
> >>>>
> >>>>
> >>>> 2013/3/18 Sebastian Schelter <ss...@apache.org>
> >>>>
> >>>>> You should also be aware that the alpha parameter comes from a
> formula
> >>>>> the authors introduce to measure the "confidence" in the observed
> >>> values:
> >>>>>
> >>>>> confidence = 1 + alpha * observed_value
> >>>>>
> >>>>> You can also change that formula in the code to something that you
> see
> >>>>> more fit, the paper even suggests alternative variants.
> >>>>>
> >>>>> Best,
> >>>>> Sebastian
> >>>>>
> >>>>>
> >>>>> On 18.03.2013 18:06, Han JU wrote:
> >>>>>> Thanks for quick responses.
> >>>>>>
> >>>>>> Yes it's that dataset. What I'm using is triplets of "user_id
> song_id
> >>>>>> play_times", of ~ 1m users. No audio things, just plein text
> triples.
> >>>>>>
> >>>>>> It seems to me that the paper about "implicit feedback" matchs well
> >>> this
> >>>>>> dataset: no explicit ratings, but times of listening to a song.
> >>>>>>
> >>>>>> Thank you Sean for the alpha value, I think they use big numbers is
> >>>>> because
> >>>>>> their values in the R matrix is big.
> >>>>>>
> >>>>>>
> >>>>>> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
> >>>>>>
> >>>>>>> JU,
> >>>>>>>
> >>>>>>> are you refering to this dataset?
> >>>>>>>
> >>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >>>>>>>
> >>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
> >>>>>>>> One word of caution, is that there are at least two papers on ALS
> >>> and
> >>>>>>> they
> >>>>>>>> define lambda differently. I think you are talking about
> >>> "Collaborative
> >>>>>>>> Filtering for Implicit Feedback Datasets".
> >>>>>>>>
> >>>>>>>> I've been working with some folks who point out that alpha=40
> seems
> >>> to
> >>>>> be
> >>>>>>>> too high for most data sets. After running some tests on common
> data
> >>>>>>> sets,
> >>>>>>>> alpha=1 looks much better. YMMV.
> >>>>>>>>
> >>>>>>>> In the end you have to evaluate these two parameters, and the # of
> >>>>>>>> features, across a range to determine what's best.
> >>>>>>>>
> >>>>>>>> Is this data set not a bunch of audio features? I am not sure it
> >>> works
> >>>>>>> for
> >>>>>>>> ALS, not naturally at least.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I'm wondering has someone tried ParallelALS with implicite
> feedback
> >>>>> job
> >>>>>>> on
> >>>>>>>>> million song dataset? Some pointers on alpha and lambda?
> >>>>>>>>>
> >>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
> >>> are
> >>>>>>> their
> >>>>>>>>> r values in the matrix. They said is based on time units that
> users
> >>>>> have
> >>>>>>>>> watched the show, so may be it's big.
> >>>>>>>>>
> >>>>>>>>> Many thanks!
> >>>>>>>>> --
> >>>>>>>>> *JU Han*
> >>>>>>>>>
> >>>>>>>>> UTC   -  Université de Technologie de Compiègne
> >>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
> >>>>>>>>>
> >>>>>>>>> +33 0619608888
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> *JU Han*
> >>
> >> Software Engineer Intern @ KXEN Inc.
> >> UTC   -  Université de Technologie de Compiègne
> >> *     **GI06 - Fouille de Données et Décisionnel*
> >>
> >> +33 0619608888
> >>
> >
> >
> >
>
>


-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: ALS-WR on Million Song dataset

Posted by Sebastian Schelter <ss...@apache.org>.
Hi JU,

I reworked the RecommenderJob in a similar way as the ALS job. Can you
give it a try?

You have to try the patch from
https://issues.apache.org/jira/browse/MAHOUT-1169

In introduces a new param to RecommenderJob called --numThreads. The
configuration of the job should be done similar to the ALS job.

/s


On 20.03.2013 12:38, Han JU wrote:
> Thanks again Sebastian and Seon, I set -Xmx4000m for mapred.child.java.opts
> and 8 threads for each mapper. Now the job runs smoothly and the whole
> factorization ends in 45min. With your settings I think it should be even
> faster.
> 
> One more thing is that the RecommendJob is kind of slow (for all users).
> For example I want to have a list of top 500 items to recommend. Any
> pointers about how to modify the job code so that it can consult a file
> then calculates recommendations only for the users id in that file?
> 
> 
> 2013/3/20 Han JU <ju...@gmail.com>
> 
>> Hi Sebastian,
>>
>> I've tried the svn trunk. Hadoop constantly complains about memory like
>> "out of memory error".
>> On the datanode there's 4 physic cores and by hyper-threading it has 16
>> logical cores, so I set --numThreadsPerSolver to 16 and that seems to have
>> a problem with memory.
>> How you set your mapred.child.java.opts? Given that we allow only one
>> mapper so that should be nearly the whole size of system memory?
>>
>> Thanks!
>>
>>
>> 2013/3/19 Sebastian Schelter <ss...@apache.org>
>>
>>> Hi JU,
>>>
>>> We recently rewrote the factorization code, it should be much faster
>>> now. You should use the current trunk, make Hadoop schedule only one
>>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
>>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>>> number of cores that you want to use per machine (use all if you can).
>>>
>>> I got astonishing results running the code like this on a 26 machines
>>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
>>> (700M datapoints).
>>>
>>> Let me know if you need more information.
>>>
>>> Best,
>>> Sebastian
>>>
>>> On 19.03.2013 15:31, Han JU wrote:
>>>> Thanks Sebastian and Sean, I will dig more into the paper.
>>>> With a simple try on a small part of the data, it seems larger alpha
>>> (~40)
>>>> gets me a better result.
>>>> Do you have an idea how long it will be for ParellelALS for the 700mb
>>>> complete dataset? It contains ~48 million triples. The hadoop cluster I
>>>> dispose is of 5 nodes and can factorize the movieLens 10M in about
>>> 13min.
>>>>
>>>>
>>>> 2013/3/18 Sebastian Schelter <ss...@apache.org>
>>>>
>>>>> You should also be aware that the alpha parameter comes from a formula
>>>>> the authors introduce to measure the "confidence" in the observed
>>> values:
>>>>>
>>>>> confidence = 1 + alpha * observed_value
>>>>>
>>>>> You can also change that formula in the code to something that you see
>>>>> more fit, the paper even suggests alternative variants.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 18.03.2013 18:06, Han JU wrote:
>>>>>> Thanks for quick responses.
>>>>>>
>>>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>>>>>> play_times", of ~ 1m users. No audio things, just plein text triples.
>>>>>>
>>>>>> It seems to me that the paper about "implicit feedback" matchs well
>>> this
>>>>>> dataset: no explicit ratings, but times of listening to a song.
>>>>>>
>>>>>> Thank you Sean for the alpha value, I think they use big numbers is
>>>>> because
>>>>>> their values in the R matrix is big.
>>>>>>
>>>>>>
>>>>>> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
>>>>>>
>>>>>>> JU,
>>>>>>>
>>>>>>> are you refering to this dataset?
>>>>>>>
>>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>>>>
>>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>>>>> One word of caution, is that there are at least two papers on ALS
>>> and
>>>>>>> they
>>>>>>>> define lambda differently. I think you are talking about
>>> "Collaborative
>>>>>>>> Filtering for Implicit Feedback Datasets".
>>>>>>>>
>>>>>>>> I've been working with some folks who point out that alpha=40 seems
>>> to
>>>>> be
>>>>>>>> too high for most data sets. After running some tests on common data
>>>>>>> sets,
>>>>>>>> alpha=1 looks much better. YMMV.
>>>>>>>>
>>>>>>>> In the end you have to evaluate these two parameters, and the # of
>>>>>>>> features, across a range to determine what's best.
>>>>>>>>
>>>>>>>> Is this data set not a bunch of audio features? I am not sure it
>>> works
>>>>>>> for
>>>>>>>> ALS, not naturally at least.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm wondering has someone tried ParallelALS with implicite feedback
>>>>> job
>>>>>>> on
>>>>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>>>>
>>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
>>> are
>>>>>>> their
>>>>>>>>> r values in the matrix. They said is based on time units that users
>>>>> have
>>>>>>>>> watched the show, so may be it's big.
>>>>>>>>>
>>>>>>>>> Many thanks!
>>>>>>>>> --
>>>>>>>>> *JU Han*
>>>>>>>>>
>>>>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>>>>
>>>>>>>>> +33 0619608888
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> *JU Han*
>>
>> Software Engineer Intern @ KXEN Inc.
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
> 
> 
> 


Re: ALS-WR on Million Song dataset

Posted by Han JU <ju...@gmail.com>.
Thanks again Sebastian and Seon, I set -Xmx4000m for mapred.child.java.opts
and 8 threads for each mapper. Now the job runs smoothly and the whole
factorization ends in 45min. With your settings I think it should be even
faster.

One more thing is that the RecommendJob is kind of slow (for all users).
For example I want to have a list of top 500 items to recommend. Any
pointers about how to modify the job code so that it can consult a file
then calculates recommendations only for the users id in that file?


2013/3/20 Han JU <ju...@gmail.com>

> Hi Sebastian,
>
> I've tried the svn trunk. Hadoop constantly complains about memory like
> "out of memory error".
> On the datanode there's 4 physic cores and by hyper-threading it has 16
> logical cores, so I set --numThreadsPerSolver to 16 and that seems to have
> a problem with memory.
> How you set your mapred.child.java.opts? Given that we allow only one
> mapper so that should be nearly the whole size of system memory?
>
> Thanks!
>
>
> 2013/3/19 Sebastian Schelter <ss...@apache.org>
>
>> Hi JU,
>>
>> We recently rewrote the factorization code, it should be much faster
>> now. You should use the current trunk, make Hadoop schedule only one
>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>> number of cores that you want to use per machine (use all if you can).
>>
>> I got astonishing results running the code like this on a 26 machines
>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
>> (700M datapoints).
>>
>> Let me know if you need more information.
>>
>> Best,
>> Sebastian
>>
>> On 19.03.2013 15:31, Han JU wrote:
>> > Thanks Sebastian and Sean, I will dig more into the paper.
>> > With a simple try on a small part of the data, it seems larger alpha
>> (~40)
>> > gets me a better result.
>> > Do you have an idea how long it will be for ParellelALS for the 700mb
>> > complete dataset? It contains ~48 million triples. The hadoop cluster I
>> > dispose is of 5 nodes and can factorize the movieLens 10M in about
>> 13min.
>> >
>> >
>> > 2013/3/18 Sebastian Schelter <ss...@apache.org>
>> >
>> >> You should also be aware that the alpha parameter comes from a formula
>> >> the authors introduce to measure the "confidence" in the observed
>> values:
>> >>
>> >> confidence = 1 + alpha * observed_value
>> >>
>> >> You can also change that formula in the code to something that you see
>> >> more fit, the paper even suggests alternative variants.
>> >>
>> >> Best,
>> >> Sebastian
>> >>
>> >>
>> >> On 18.03.2013 18:06, Han JU wrote:
>> >>> Thanks for quick responses.
>> >>>
>> >>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>> >>> play_times", of ~ 1m users. No audio things, just plein text triples.
>> >>>
>> >>> It seems to me that the paper about "implicit feedback" matchs well
>> this
>> >>> dataset: no explicit ratings, but times of listening to a song.
>> >>>
>> >>> Thank you Sean for the alpha value, I think they use big numbers is
>> >> because
>> >>> their values in the R matrix is big.
>> >>>
>> >>>
>> >>> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
>> >>>
>> >>>> JU,
>> >>>>
>> >>>> are you refering to this dataset?
>> >>>>
>> >>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>> >>>>
>> >>>> On 18.03.2013 17:47, Sean Owen wrote:
>> >>>>> One word of caution, is that there are at least two papers on ALS
>> and
>> >>>> they
>> >>>>> define lambda differently. I think you are talking about
>> "Collaborative
>> >>>>> Filtering for Implicit Feedback Datasets".
>> >>>>>
>> >>>>> I've been working with some folks who point out that alpha=40 seems
>> to
>> >> be
>> >>>>> too high for most data sets. After running some tests on common data
>> >>>> sets,
>> >>>>> alpha=1 looks much better. YMMV.
>> >>>>>
>> >>>>> In the end you have to evaluate these two parameters, and the # of
>> >>>>> features, across a range to determine what's best.
>> >>>>>
>> >>>>> Is this data set not a bunch of audio features? I am not sure it
>> works
>> >>>> for
>> >>>>> ALS, not naturally at least.
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
>> >> wrote:
>> >>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I'm wondering has someone tried ParallelALS with implicite feedback
>> >> job
>> >>>> on
>> >>>>>> million song dataset? Some pointers on alpha and lambda?
>> >>>>>>
>> >>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
>> are
>> >>>> their
>> >>>>>> r values in the matrix. They said is based on time units that users
>> >> have
>> >>>>>> watched the show, so may be it's big.
>> >>>>>>
>> >>>>>> Many thanks!
>> >>>>>> --
>> >>>>>> *JU Han*
>> >>>>>>
>> >>>>>> UTC   -  Université de Technologie de Compiègne
>> >>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>> >>>>>>
>> >>>>>> +33 0619608888
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>>
>
>
> --
> *JU Han*
>
> Software Engineer Intern @ KXEN Inc.
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>



-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: ALS-WR on Million Song dataset

Posted by Sebastian Schelter <ss...@apache.org>.
I concur with everything that you state. In ideal world, we would have a
framework that offers a well implemented hybrid hash-join [1] that takes
advantage of all available memory and gracefully uses the disk once the
amount of memory is not enough, such as the one used by Stratosphere [2].

Best,
Sebastian

[1] Graefe, Bunker, Cooper: Hash Joins and Hash Teams in Microsoft SQL
Server
[2]
https://github.com/stratosphere-eu/stratosphere/blob/master/pact/pact-runtime/src/main/java/eu/stratosphere/pact/runtime/hash/MutableHashTable.java



On 20.03.2013 11:52, Sean Owen wrote:
> This highlights the 'cheating' going on here to make this run so fast, and
> I completely endorse it. The join proceeds by side-loading half of the join
> into memory. This creates a new possible limiting bottleneck here, the
> amount of memory allocated to the mapper (/reducer).
> 
> There are a number of things that could be done later to optimize this,
> though they might be kind of painful to implement. One is using floats
> instead of doubles. The other is precomputing which items are needed in
> which mappers and then only loading that subset into memory.
> 
> In some of the development I have done these and other tricks make these
> things run fast. The down-side is just that it takes effort to optimize
> that far and a lot of the optimization doesn't generalize to other
> algorithms.
> 
> I still quite buy the argument that there are going to be better frameworks
> for this, where you don't fight and work around the paradigm to get things
> done. But it's quite possible to make these things hum on the commodity
> distributed paradigm of today.
> 
> (PS: anyone at Hadoop Summit Europe today, let me know)
> 
> 
> On Wed, Mar 20, 2013 at 11:39 AM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> Hi JU,
>>
>> the job creates an OpenIntObjectHashMap<Vector> holding the feature
>> vectors as DenseVectors. In one map-job, it is filled with the
>> user-feature vectors, in the next one with the item feature vectors.
>>
>> I used 4 gigabytes for a dataset with  1.8M users (using 20 features),
>> so I guess that 2-3gig should be enough for your dataset.
>>
>> I used these settings:
>>
>> mapred.job.reuse.jvm.num.tasks=-1
>> mapred.tasktracker.map.tasks.maximum=1
>> mapred.child.java.opts=-Xmx4096m
>>
>> On 20.03.2013 10:01, Han JU wrote:
>>> Hi Sebastian,
>>>
>>> I've tried the svn trunk. Hadoop constantly complains about memory like
>>> "out of memory error".
>>> On the datanode there's 4 physic cores and by hyper-threading it has 16
>>> logical cores, so I set --numThreadsPerSolver to 16 and that seems to
>> have
>>> a problem with memory.
>>> How you set your mapred.child.java.opts? Given that we allow only one
>>> mapper so that should be nearly the whole size of system memory?
>>>
>>> Thanks!
>>>
>>>
>>> 2013/3/19 Sebastian Schelter <ss...@apache.org>
>>>
>>>> Hi JU,
>>>>
>>>> We recently rewrote the factorization code, it should be much faster
>>>> now. You should use the current trunk, make Hadoop schedule only one
>>>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
>>>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>>>> number of cores that you want to use per machine (use all if you can).
>>>>
>>>> I got astonishing results running the code like this on a 26 machines
>>>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
>>>> (700M datapoints).
>>>>
>>>> Let me know if you need more information.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 19.03.2013 15:31, Han JU wrote:
>>>>> Thanks Sebastian and Sean, I will dig more into the paper.
>>>>> With a simple try on a small part of the data, it seems larger alpha
>>>> (~40)
>>>>> gets me a better result.
>>>>> Do you have an idea how long it will be for ParellelALS for the 700mb
>>>>> complete dataset? It contains ~48 million triples. The hadoop cluster I
>>>>> dispose is of 5 nodes and can factorize the movieLens 10M in about
>> 13min.
>>>>>
>>>>>
>>>>> 2013/3/18 Sebastian Schelter <ss...@apache.org>
>>>>>
>>>>>> You should also be aware that the alpha parameter comes from a formula
>>>>>> the authors introduce to measure the "confidence" in the observed
>>>> values:
>>>>>>
>>>>>> confidence = 1 + alpha * observed_value
>>>>>>
>>>>>> You can also change that formula in the code to something that you see
>>>>>> more fit, the paper even suggests alternative variants.
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>>
>>>>>> On 18.03.2013 18:06, Han JU wrote:
>>>>>>> Thanks for quick responses.
>>>>>>>
>>>>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>>>>>>> play_times", of ~ 1m users. No audio things, just plein text triples.
>>>>>>>
>>>>>>> It seems to me that the paper about "implicit feedback" matchs well
>>>> this
>>>>>>> dataset: no explicit ratings, but times of listening to a song.
>>>>>>>
>>>>>>> Thank you Sean for the alpha value, I think they use big numbers is
>>>>>> because
>>>>>>> their values in the R matrix is big.
>>>>>>>
>>>>>>>
>>>>>>> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
>>>>>>>
>>>>>>>> JU,
>>>>>>>>
>>>>>>>> are you refering to this dataset?
>>>>>>>>
>>>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>>>>>
>>>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>>>>>> One word of caution, is that there are at least two papers on ALS
>> and
>>>>>>>> they
>>>>>>>>> define lambda differently. I think you are talking about
>>>> "Collaborative
>>>>>>>>> Filtering for Implicit Feedback Datasets".
>>>>>>>>>
>>>>>>>>> I've been working with some folks who point out that alpha=40 seems
>>>> to
>>>>>> be
>>>>>>>>> too high for most data sets. After running some tests on common
>> data
>>>>>>>> sets,
>>>>>>>>> alpha=1 looks much better. YMMV.
>>>>>>>>>
>>>>>>>>> In the end you have to evaluate these two parameters, and the # of
>>>>>>>>> features, across a range to determine what's best.
>>>>>>>>>
>>>>>>>>> Is this data set not a bunch of audio features? I am not sure it
>>>> works
>>>>>>>> for
>>>>>>>>> ALS, not naturally at least.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm wondering has someone tried ParallelALS with implicite
>> feedback
>>>>>> job
>>>>>>>> on
>>>>>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>>>>>
>>>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
>>>> are
>>>>>>>> their
>>>>>>>>>> r values in the matrix. They said is based on time units that
>> users
>>>>>> have
>>>>>>>>>> watched the show, so may be it's big.
>>>>>>>>>>
>>>>>>>>>> Many thanks!
>>>>>>>>>> --
>>>>>>>>>> *JU Han*
>>>>>>>>>>
>>>>>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>>>>>
>>>>>>>>>> +33 0619608888
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 


Re: ALS-WR on Million Song dataset

Posted by Sean Owen <sr...@gmail.com>.
This highlights the 'cheating' going on here to make this run so fast, and
I completely endorse it. The join proceeds by side-loading half of the join
into memory. This creates a new possible limiting bottleneck here, the
amount of memory allocated to the mapper (/reducer).

There are a number of things that could be done later to optimize this,
though they might be kind of painful to implement. One is using floats
instead of doubles. The other is precomputing which items are needed in
which mappers and then only loading that subset into memory.

In some of the development I have done these and other tricks make these
things run fast. The down-side is just that it takes effort to optimize
that far and a lot of the optimization doesn't generalize to other
algorithms.

I still quite buy the argument that there are going to be better frameworks
for this, where you don't fight and work around the paradigm to get things
done. But it's quite possible to make these things hum on the commodity
distributed paradigm of today.

(PS: anyone at Hadoop Summit Europe today, let me know)


On Wed, Mar 20, 2013 at 11:39 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi JU,
>
> the job creates an OpenIntObjectHashMap<Vector> holding the feature
> vectors as DenseVectors. In one map-job, it is filled with the
> user-feature vectors, in the next one with the item feature vectors.
>
> I used 4 gigabytes for a dataset with  1.8M users (using 20 features),
> so I guess that 2-3gig should be enough for your dataset.
>
> I used these settings:
>
> mapred.job.reuse.jvm.num.tasks=-1
> mapred.tasktracker.map.tasks.maximum=1
> mapred.child.java.opts=-Xmx4096m
>
> On 20.03.2013 10:01, Han JU wrote:
> > Hi Sebastian,
> >
> > I've tried the svn trunk. Hadoop constantly complains about memory like
> > "out of memory error".
> > On the datanode there's 4 physic cores and by hyper-threading it has 16
> > logical cores, so I set --numThreadsPerSolver to 16 and that seems to
> have
> > a problem with memory.
> > How you set your mapred.child.java.opts? Given that we allow only one
> > mapper so that should be nearly the whole size of system memory?
> >
> > Thanks!
> >
> >
> > 2013/3/19 Sebastian Schelter <ss...@apache.org>
> >
> >> Hi JU,
> >>
> >> We recently rewrote the factorization code, it should be much faster
> >> now. You should use the current trunk, make Hadoop schedule only one
> >> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
> >> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
> >> number of cores that you want to use per machine (use all if you can).
> >>
> >> I got astonishing results running the code like this on a 26 machines
> >> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
> >> (700M datapoints).
> >>
> >> Let me know if you need more information.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 19.03.2013 15:31, Han JU wrote:
> >>> Thanks Sebastian and Sean, I will dig more into the paper.
> >>> With a simple try on a small part of the data, it seems larger alpha
> >> (~40)
> >>> gets me a better result.
> >>> Do you have an idea how long it will be for ParellelALS for the 700mb
> >>> complete dataset? It contains ~48 million triples. The hadoop cluster I
> >>> dispose is of 5 nodes and can factorize the movieLens 10M in about
> 13min.
> >>>
> >>>
> >>> 2013/3/18 Sebastian Schelter <ss...@apache.org>
> >>>
> >>>> You should also be aware that the alpha parameter comes from a formula
> >>>> the authors introduce to measure the "confidence" in the observed
> >> values:
> >>>>
> >>>> confidence = 1 + alpha * observed_value
> >>>>
> >>>> You can also change that formula in the code to something that you see
> >>>> more fit, the paper even suggests alternative variants.
> >>>>
> >>>> Best,
> >>>> Sebastian
> >>>>
> >>>>
> >>>> On 18.03.2013 18:06, Han JU wrote:
> >>>>> Thanks for quick responses.
> >>>>>
> >>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
> >>>>> play_times", of ~ 1m users. No audio things, just plein text triples.
> >>>>>
> >>>>> It seems to me that the paper about "implicit feedback" matchs well
> >> this
> >>>>> dataset: no explicit ratings, but times of listening to a song.
> >>>>>
> >>>>> Thank you Sean for the alpha value, I think they use big numbers is
> >>>> because
> >>>>> their values in the R matrix is big.
> >>>>>
> >>>>>
> >>>>> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
> >>>>>
> >>>>>> JU,
> >>>>>>
> >>>>>> are you refering to this dataset?
> >>>>>>
> >>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >>>>>>
> >>>>>> On 18.03.2013 17:47, Sean Owen wrote:
> >>>>>>> One word of caution, is that there are at least two papers on ALS
> and
> >>>>>> they
> >>>>>>> define lambda differently. I think you are talking about
> >> "Collaborative
> >>>>>>> Filtering for Implicit Feedback Datasets".
> >>>>>>>
> >>>>>>> I've been working with some folks who point out that alpha=40 seems
> >> to
> >>>> be
> >>>>>>> too high for most data sets. After running some tests on common
> data
> >>>>>> sets,
> >>>>>>> alpha=1 looks much better. YMMV.
> >>>>>>>
> >>>>>>> In the end you have to evaluate these two parameters, and the # of
> >>>>>>> features, across a range to determine what's best.
> >>>>>>>
> >>>>>>> Is this data set not a bunch of audio features? I am not sure it
> >> works
> >>>>>> for
> >>>>>>> ALS, not naturally at least.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I'm wondering has someone tried ParallelALS with implicite
> feedback
> >>>> job
> >>>>>> on
> >>>>>>>> million song dataset? Some pointers on alpha and lambda?
> >>>>>>>>
> >>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
> >> are
> >>>>>> their
> >>>>>>>> r values in the matrix. They said is based on time units that
> users
> >>>> have
> >>>>>>>> watched the show, so may be it's big.
> >>>>>>>>
> >>>>>>>> Many thanks!
> >>>>>>>> --
> >>>>>>>> *JU Han*
> >>>>>>>>
> >>>>>>>> UTC   -  Université de Technologie de Compiègne
> >>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
> >>>>>>>>
> >>>>>>>> +33 0619608888
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>

Re: ALS-WR on Million Song dataset

Posted by Sebastian Schelter <ss...@apache.org>.
Hi JU,

the job creates an OpenIntObjectHashMap<Vector> holding the feature
vectors as DenseVectors. In one map-job, it is filled with the
user-feature vectors, in the next one with the item feature vectors.

I used 4 gigabytes for a dataset with  1.8M users (using 20 features),
so I guess that 2-3gig should be enough for your dataset.

I used these settings:

mapred.job.reuse.jvm.num.tasks=-1
mapred.tasktracker.map.tasks.maximum=1
mapred.child.java.opts=-Xmx4096m

On 20.03.2013 10:01, Han JU wrote:
> Hi Sebastian,
> 
> I've tried the svn trunk. Hadoop constantly complains about memory like
> "out of memory error".
> On the datanode there's 4 physic cores and by hyper-threading it has 16
> logical cores, so I set --numThreadsPerSolver to 16 and that seems to have
> a problem with memory.
> How you set your mapred.child.java.opts? Given that we allow only one
> mapper so that should be nearly the whole size of system memory?
> 
> Thanks!
> 
> 
> 2013/3/19 Sebastian Schelter <ss...@apache.org>
> 
>> Hi JU,
>>
>> We recently rewrote the factorization code, it should be much faster
>> now. You should use the current trunk, make Hadoop schedule only one
>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>> number of cores that you want to use per machine (use all if you can).
>>
>> I got astonishing results running the code like this on a 26 machines
>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
>> (700M datapoints).
>>
>> Let me know if you need more information.
>>
>> Best,
>> Sebastian
>>
>> On 19.03.2013 15:31, Han JU wrote:
>>> Thanks Sebastian and Sean, I will dig more into the paper.
>>> With a simple try on a small part of the data, it seems larger alpha
>> (~40)
>>> gets me a better result.
>>> Do you have an idea how long it will be for ParellelALS for the 700mb
>>> complete dataset? It contains ~48 million triples. The hadoop cluster I
>>> dispose is of 5 nodes and can factorize the movieLens 10M in about 13min.
>>>
>>>
>>> 2013/3/18 Sebastian Schelter <ss...@apache.org>
>>>
>>>> You should also be aware that the alpha parameter comes from a formula
>>>> the authors introduce to measure the "confidence" in the observed
>> values:
>>>>
>>>> confidence = 1 + alpha * observed_value
>>>>
>>>> You can also change that formula in the code to something that you see
>>>> more fit, the paper even suggests alternative variants.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>
>>>> On 18.03.2013 18:06, Han JU wrote:
>>>>> Thanks for quick responses.
>>>>>
>>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>>>>> play_times", of ~ 1m users. No audio things, just plein text triples.
>>>>>
>>>>> It seems to me that the paper about "implicit feedback" matchs well
>> this
>>>>> dataset: no explicit ratings, but times of listening to a song.
>>>>>
>>>>> Thank you Sean for the alpha value, I think they use big numbers is
>>>> because
>>>>> their values in the R matrix is big.
>>>>>
>>>>>
>>>>> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
>>>>>
>>>>>> JU,
>>>>>>
>>>>>> are you refering to this dataset?
>>>>>>
>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>>>
>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>>>> One word of caution, is that there are at least two papers on ALS and
>>>>>> they
>>>>>>> define lambda differently. I think you are talking about
>> "Collaborative
>>>>>>> Filtering for Implicit Feedback Datasets".
>>>>>>>
>>>>>>> I've been working with some folks who point out that alpha=40 seems
>> to
>>>> be
>>>>>>> too high for most data sets. After running some tests on common data
>>>>>> sets,
>>>>>>> alpha=1 looks much better. YMMV.
>>>>>>>
>>>>>>> In the end you have to evaluate these two parameters, and the # of
>>>>>>> features, across a range to determine what's best.
>>>>>>>
>>>>>>> Is this data set not a bunch of audio features? I am not sure it
>> works
>>>>>> for
>>>>>>> ALS, not naturally at least.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm wondering has someone tried ParallelALS with implicite feedback
>>>> job
>>>>>> on
>>>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>>>
>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
>> are
>>>>>> their
>>>>>>>> r values in the matrix. They said is based on time units that users
>>>> have
>>>>>>>> watched the show, so may be it's big.
>>>>>>>>
>>>>>>>> Many thanks!
>>>>>>>> --
>>>>>>>> *JU Han*
>>>>>>>>
>>>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>>>
>>>>>>>> +33 0619608888
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
> 


Re: ALS-WR on Million Song dataset

Posted by Han JU <ju...@gmail.com>.
Hi Sebastian,

I've tried the svn trunk. Hadoop constantly complains about memory like
"out of memory error".
On the datanode there's 4 physic cores and by hyper-threading it has 16
logical cores, so I set --numThreadsPerSolver to 16 and that seems to have
a problem with memory.
How you set your mapred.child.java.opts? Given that we allow only one
mapper so that should be nearly the whole size of system memory?

Thanks!


2013/3/19 Sebastian Schelter <ss...@apache.org>

> Hi JU,
>
> We recently rewrote the factorization code, it should be much faster
> now. You should use the current trunk, make Hadoop schedule only one
> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
> number of cores that you want to use per machine (use all if you can).
>
> I got astonishing results running the code like this on a 26 machines
> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
> (700M datapoints).
>
> Let me know if you need more information.
>
> Best,
> Sebastian
>
> On 19.03.2013 15:31, Han JU wrote:
> > Thanks Sebastian and Sean, I will dig more into the paper.
> > With a simple try on a small part of the data, it seems larger alpha
> (~40)
> > gets me a better result.
> > Do you have an idea how long it will be for ParellelALS for the 700mb
> > complete dataset? It contains ~48 million triples. The hadoop cluster I
> > dispose is of 5 nodes and can factorize the movieLens 10M in about 13min.
> >
> >
> > 2013/3/18 Sebastian Schelter <ss...@apache.org>
> >
> >> You should also be aware that the alpha parameter comes from a formula
> >> the authors introduce to measure the "confidence" in the observed
> values:
> >>
> >> confidence = 1 + alpha * observed_value
> >>
> >> You can also change that formula in the code to something that you see
> >> more fit, the paper even suggests alternative variants.
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> On 18.03.2013 18:06, Han JU wrote:
> >>> Thanks for quick responses.
> >>>
> >>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
> >>> play_times", of ~ 1m users. No audio things, just plein text triples.
> >>>
> >>> It seems to me that the paper about "implicit feedback" matchs well
> this
> >>> dataset: no explicit ratings, but times of listening to a song.
> >>>
> >>> Thank you Sean for the alpha value, I think they use big numbers is
> >> because
> >>> their values in the R matrix is big.
> >>>
> >>>
> >>> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
> >>>
> >>>> JU,
> >>>>
> >>>> are you refering to this dataset?
> >>>>
> >>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >>>>
> >>>> On 18.03.2013 17:47, Sean Owen wrote:
> >>>>> One word of caution, is that there are at least two papers on ALS and
> >>>> they
> >>>>> define lambda differently. I think you are talking about
> "Collaborative
> >>>>> Filtering for Implicit Feedback Datasets".
> >>>>>
> >>>>> I've been working with some folks who point out that alpha=40 seems
> to
> >> be
> >>>>> too high for most data sets. After running some tests on common data
> >>>> sets,
> >>>>> alpha=1 looks much better. YMMV.
> >>>>>
> >>>>> In the end you have to evaluate these two parameters, and the # of
> >>>>> features, across a range to determine what's best.
> >>>>>
> >>>>> Is this data set not a bunch of audio features? I am not sure it
> works
> >>>> for
> >>>>> ALS, not naturally at least.
> >>>>>
> >>>>>
> >>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm wondering has someone tried ParallelALS with implicite feedback
> >> job
> >>>> on
> >>>>>> million song dataset? Some pointers on alpha and lambda?
> >>>>>>
> >>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
> are
> >>>> their
> >>>>>> r values in the matrix. They said is based on time units that users
> >> have
> >>>>>> watched the show, so may be it's big.
> >>>>>>
> >>>>>> Many thanks!
> >>>>>> --
> >>>>>> *JU Han*
> >>>>>>
> >>>>>> UTC   -  Université de Technologie de Compiègne
> >>>>>> *     **GI06 - Fouille de Données et Décisionnel*
> >>>>>>
> >>>>>> +33 0619608888
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: ALS-WR on Million Song dataset

Posted by Sebastian Schelter <ss...@apache.org>.
Hi JU,

We recently rewrote the factorization code, it should be much faster
now. You should use the current trunk, make Hadoop schedule only one
mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
it reuse the JVMs and add the parameter --numThreadsPerSolver with the
number of cores that you want to use per machine (use all if you can).

I got astonishing results running the code like this on a 26 machines
cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
(700M datapoints).

Let me know if you need more information.

Best,
Sebastian

On 19.03.2013 15:31, Han JU wrote:
> Thanks Sebastian and Sean, I will dig more into the paper.
> With a simple try on a small part of the data, it seems larger alpha (~40)
> gets me a better result.
> Do you have an idea how long it will be for ParellelALS for the 700mb
> complete dataset? It contains ~48 million triples. The hadoop cluster I
> dispose is of 5 nodes and can factorize the movieLens 10M in about 13min.
> 
> 
> 2013/3/18 Sebastian Schelter <ss...@apache.org>
> 
>> You should also be aware that the alpha parameter comes from a formula
>> the authors introduce to measure the "confidence" in the observed values:
>>
>> confidence = 1 + alpha * observed_value
>>
>> You can also change that formula in the code to something that you see
>> more fit, the paper even suggests alternative variants.
>>
>> Best,
>> Sebastian
>>
>>
>> On 18.03.2013 18:06, Han JU wrote:
>>> Thanks for quick responses.
>>>
>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>>> play_times", of ~ 1m users. No audio things, just plein text triples.
>>>
>>> It seems to me that the paper about "implicit feedback" matchs well this
>>> dataset: no explicit ratings, but times of listening to a song.
>>>
>>> Thank you Sean for the alpha value, I think they use big numbers is
>> because
>>> their values in the R matrix is big.
>>>
>>>
>>> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
>>>
>>>> JU,
>>>>
>>>> are you refering to this dataset?
>>>>
>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>
>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>> One word of caution, is that there are at least two papers on ALS and
>>>> they
>>>>> define lambda differently. I think you are talking about "Collaborative
>>>>> Filtering for Implicit Feedback Datasets".
>>>>>
>>>>> I've been working with some folks who point out that alpha=40 seems to
>> be
>>>>> too high for most data sets. After running some tests on common data
>>>> sets,
>>>>> alpha=1 looks much better. YMMV.
>>>>>
>>>>> In the end you have to evaluate these two parameters, and the # of
>>>>> features, across a range to determine what's best.
>>>>>
>>>>> Is this data set not a bunch of audio features? I am not sure it works
>>>> for
>>>>> ALS, not naturally at least.
>>>>>
>>>>>
>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm wondering has someone tried ParallelALS with implicite feedback
>> job
>>>> on
>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>
>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what are
>>>> their
>>>>>> r values in the matrix. They said is based on time units that users
>> have
>>>>>> watched the show, so may be it's big.
>>>>>>
>>>>>> Many thanks!
>>>>>> --
>>>>>> *JU Han*
>>>>>>
>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>
>>>>>> +33 0619608888
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
> 


Re: ALS-WR on Million Song dataset

Posted by Han JU <ju...@gmail.com>.
Thanks Sebastian and Sean, I will dig more into the paper.
With a simple try on a small part of the data, it seems larger alpha (~40)
gets me a better result.
Do you have an idea how long it will be for ParellelALS for the 700mb
complete dataset? It contains ~48 million triples. The hadoop cluster I
dispose is of 5 nodes and can factorize the movieLens 10M in about 13min.


2013/3/18 Sebastian Schelter <ss...@apache.org>

> You should also be aware that the alpha parameter comes from a formula
> the authors introduce to measure the "confidence" in the observed values:
>
> confidence = 1 + alpha * observed_value
>
> You can also change that formula in the code to something that you see
> more fit, the paper even suggests alternative variants.
>
> Best,
> Sebastian
>
>
> On 18.03.2013 18:06, Han JU wrote:
> > Thanks for quick responses.
> >
> > Yes it's that dataset. What I'm using is triplets of "user_id song_id
> > play_times", of ~ 1m users. No audio things, just plein text triples.
> >
> > It seems to me that the paper about "implicit feedback" matchs well this
> > dataset: no explicit ratings, but times of listening to a song.
> >
> > Thank you Sean for the alpha value, I think they use big numbers is
> because
> > their values in the R matrix is big.
> >
> >
> > 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
> >
> >> JU,
> >>
> >> are you refering to this dataset?
> >>
> >> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >>
> >> On 18.03.2013 17:47, Sean Owen wrote:
> >>> One word of caution, is that there are at least two papers on ALS and
> >> they
> >>> define lambda differently. I think you are talking about "Collaborative
> >>> Filtering for Implicit Feedback Datasets".
> >>>
> >>> I've been working with some folks who point out that alpha=40 seems to
> be
> >>> too high for most data sets. After running some tests on common data
> >> sets,
> >>> alpha=1 looks much better. YMMV.
> >>>
> >>> In the end you have to evaluate these two parameters, and the # of
> >>> features, across a range to determine what's best.
> >>>
> >>> Is this data set not a bunch of audio features? I am not sure it works
> >> for
> >>> ALS, not naturally at least.
> >>>
> >>>
> >>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I'm wondering has someone tried ParallelALS with implicite feedback
> job
> >> on
> >>>> million song dataset? Some pointers on alpha and lambda?
> >>>>
> >>>> In the paper alpha is 40 and lambda is 150, but I don't know what are
> >> their
> >>>> r values in the matrix. They said is based on time units that users
> have
> >>>> watched the show, so may be it's big.
> >>>>
> >>>> Many thanks!
> >>>> --
> >>>> *JU Han*
> >>>>
> >>>> UTC   -  Université de Technologie de Compiègne
> >>>> *     **GI06 - Fouille de Données et Décisionnel*
> >>>>
> >>>> +33 0619608888
> >>>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: ALS-WR on Million Song dataset

Posted by Sebastian Schelter <ss...@apache.org>.
You should also be aware that the alpha parameter comes from a formula
the authors introduce to measure the "confidence" in the observed values:

confidence = 1 + alpha * observed_value

You can also change that formula in the code to something that you see
more fit, the paper even suggests alternative variants.

Best,
Sebastian


On 18.03.2013 18:06, Han JU wrote:
> Thanks for quick responses.
> 
> Yes it's that dataset. What I'm using is triplets of "user_id song_id
> play_times", of ~ 1m users. No audio things, just plein text triples.
> 
> It seems to me that the paper about "implicit feedback" matchs well this
> dataset: no explicit ratings, but times of listening to a song.
> 
> Thank you Sean for the alpha value, I think they use big numbers is because
> their values in the R matrix is big.
> 
> 
> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
> 
>> JU,
>>
>> are you refering to this dataset?
>>
>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>
>> On 18.03.2013 17:47, Sean Owen wrote:
>>> One word of caution, is that there are at least two papers on ALS and
>> they
>>> define lambda differently. I think you are talking about "Collaborative
>>> Filtering for Implicit Feedback Datasets".
>>>
>>> I've been working with some folks who point out that alpha=40 seems to be
>>> too high for most data sets. After running some tests on common data
>> sets,
>>> alpha=1 looks much better. YMMV.
>>>
>>> In the end you have to evaluate these two parameters, and the # of
>>> features, across a range to determine what's best.
>>>
>>> Is this data set not a bunch of audio features? I am not sure it works
>> for
>>> ALS, not naturally at least.
>>>
>>>
>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm wondering has someone tried ParallelALS with implicite feedback job
>> on
>>>> million song dataset? Some pointers on alpha and lambda?
>>>>
>>>> In the paper alpha is 40 and lambda is 150, but I don't know what are
>> their
>>>> r values in the matrix. They said is based on time units that users have
>>>> watched the show, so may be it's big.
>>>>
>>>> Many thanks!
>>>> --
>>>> *JU Han*
>>>>
>>>> UTC   -  Université de Technologie de Compiègne
>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>
>>>> +33 0619608888
>>>>
>>>
>>
>>
> 
> 


Re: ALS-WR on Million Song dataset

Posted by Sean Owen <sr...@gmail.com>.
Yes that's fine input then.

Large alpha should go with small R values, not large R. Really alpha
controls how much observed input (R != 0) is weighted towards 1 versus how
much unobserved input (R=0) is weighted to 0. I scale lambda by alpha to
complete this effect.


On Mon, Mar 18, 2013 at 1:06 PM, Han JU <ju...@gmail.com> wrote:

> Thanks for quick responses.
>
> Yes it's that dataset. What I'm using is triplets of "user_id song_id
> play_times", of ~ 1m users. No audio things, just plein text triples.
>
> It seems to me that the paper about "implicit feedback" matchs well this
> dataset: no explicit ratings, but times of listening to a song.
>
> Thank you Sean for the alpha value, I think they use big numbers is because
> their values in the R matrix is big.
>
>
> 2013/3/18 Sebastian Schelter <ss...@googlemail.com>
>
> > JU,
> >
> > are you refering to this dataset?
> >
> > http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >
> > On 18.03.2013 17:47, Sean Owen wrote:
> > > One word of caution, is that there are at least two papers on ALS and
> > they
> > > define lambda differently. I think you are talking about "Collaborative
> > > Filtering for Implicit Feedback Datasets".
> > >
> > > I've been working with some folks who point out that alpha=40 seems to
> be
> > > too high for most data sets. After running some tests on common data
> > sets,
> > > alpha=1 looks much better. YMMV.
> > >
> > > In the end you have to evaluate these two parameters, and the # of
> > > features, across a range to determine what's best.
> > >
> > > Is this data set not a bunch of audio features? I am not sure it works
> > for
> > > ALS, not naturally at least.
> > >
> > >
> > > On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com>
> wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm wondering has someone tried ParallelALS with implicite feedback
> job
> > on
> > >> million song dataset? Some pointers on alpha and lambda?
> > >>
> > >> In the paper alpha is 40 and lambda is 150, but I don't know what are
> > their
> > >> r values in the matrix. They said is based on time units that users
> have
> > >> watched the show, so may be it's big.
> > >>
> > >> Many thanks!
> > >> --
> > >> *JU Han*
> > >>
> > >> UTC   -  Université de Technologie de Compiègne
> > >> *     **GI06 - Fouille de Données et Décisionnel*
> > >>
> > >> +33 0619608888
> > >>
> > >
> >
> >
>
>
> --
> *JU Han*
>
> Software Engineer Intern @ KXEN Inc.
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>

Re: ALS-WR on Million Song dataset

Posted by Han JU <ju...@gmail.com>.
Thanks for quick responses.

Yes it's that dataset. What I'm using is triplets of "user_id song_id
play_times", of ~ 1m users. No audio things, just plein text triples.

It seems to me that the paper about "implicit feedback" matchs well this
dataset: no explicit ratings, but times of listening to a song.

Thank you Sean for the alpha value, I think they use big numbers is because
their values in the R matrix is big.


2013/3/18 Sebastian Schelter <ss...@googlemail.com>

> JU,
>
> are you refering to this dataset?
>
> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>
> On 18.03.2013 17:47, Sean Owen wrote:
> > One word of caution, is that there are at least two papers on ALS and
> they
> > define lambda differently. I think you are talking about "Collaborative
> > Filtering for Implicit Feedback Datasets".
> >
> > I've been working with some folks who point out that alpha=40 seems to be
> > too high for most data sets. After running some tests on common data
> sets,
> > alpha=1 looks much better. YMMV.
> >
> > In the end you have to evaluate these two parameters, and the # of
> > features, across a range to determine what's best.
> >
> > Is this data set not a bunch of audio features? I am not sure it works
> for
> > ALS, not naturally at least.
> >
> >
> > On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I'm wondering has someone tried ParallelALS with implicite feedback job
> on
> >> million song dataset? Some pointers on alpha and lambda?
> >>
> >> In the paper alpha is 40 and lambda is 150, but I don't know what are
> their
> >> r values in the matrix. They said is based on time units that users have
> >> watched the show, so may be it's big.
> >>
> >> Many thanks!
> >> --
> >> *JU Han*
> >>
> >> UTC   -  Université de Technologie de Compiègne
> >> *     **GI06 - Fouille de Données et Décisionnel*
> >>
> >> +33 0619608888
> >>
> >
>
>


-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: ALS-WR on Million Song dataset

Posted by Sebastian Schelter <ss...@googlemail.com>.
JU,

are you refering to this dataset?

http://labrosa.ee.columbia.edu/millionsong/tasteprofile

On 18.03.2013 17:47, Sean Owen wrote:
> One word of caution, is that there are at least two papers on ALS and they
> define lambda differently. I think you are talking about "Collaborative
> Filtering for Implicit Feedback Datasets".
> 
> I've been working with some folks who point out that alpha=40 seems to be
> too high for most data sets. After running some tests on common data sets,
> alpha=1 looks much better. YMMV.
> 
> In the end you have to evaluate these two parameters, and the # of
> features, across a range to determine what's best.
> 
> Is this data set not a bunch of audio features? I am not sure it works for
> ALS, not naturally at least.
> 
> 
> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com> wrote:
> 
>> Hi,
>>
>> I'm wondering has someone tried ParallelALS with implicite feedback job on
>> million song dataset? Some pointers on alpha and lambda?
>>
>> In the paper alpha is 40 and lambda is 150, but I don't know what are their
>> r values in the matrix. They said is based on time units that users have
>> watched the show, so may be it's big.
>>
>> Many thanks!
>> --
>> *JU Han*
>>
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
> 


Re: ALS-WR on Million Song dataset

Posted by Sean Owen <sr...@gmail.com>.
One word of caution, is that there are at least two papers on ALS and they
define lambda differently. I think you are talking about "Collaborative
Filtering for Implicit Feedback Datasets".

I've been working with some folks who point out that alpha=40 seems to be
too high for most data sets. After running some tests on common data sets,
alpha=1 looks much better. YMMV.

In the end you have to evaluate these two parameters, and the # of
features, across a range to determine what's best.

Is this data set not a bunch of audio features? I am not sure it works for
ALS, not naturally at least.


On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> I'm wondering has someone tried ParallelALS with implicite feedback job on
> million song dataset? Some pointers on alpha and lambda?
>
> In the paper alpha is 40 and lambda is 150, but I don't know what are their
> r values in the matrix. They said is based on time units that users have
> watched the show, so may be it's big.
>
> Many thanks!
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>