You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nick Pentreath <ni...@gmail.com> on 2014/04/01 22:15:24 UTC

Re: possible bug in Spark's ALS implementation...

Hi Michael

Would you mind setting out exactly what differences you did find between
the Spark and Oryx implementations? Would be good to be clear on them, and
also see if there are further tricks/enhancements from the Oryx one that
can be ported (such as the lambda * numRatings adjustment).

N


On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <ms...@allman.ms> wrote:

> I've been thoroughly investigating this issue over the past couple of days
> and have discovered quite a bit. For one thing, there is definitely (at
> least) one issue/bug in the Spark implementation that leads to incorrect
> results for models generated with rank > 1 or a large number of iterations.
> I will post a bug report with a thorough explanation this weekend or on
> Monday.
>
> I believe I've been able to track down every difference between the Spark
> and Oryx implementations that lead to difference results. I made some
> adjustments to the spark implementation so that, given the same initial
> product/item vectors, the resulting model is identical to the one produced
> by Oryx within a small numerical tolerance. I've verified this for small
> data sets and am working on verifying this with some large data sets.
>
> Aside from those already identified in this thread, another significant
> difference in the Spark implementation is that it begins the factorization
> process by computing the product matrix (Y) from the initial user matrix
> (X). Both of the papers on ALS referred to in this thread begin the process
> by computing the user matrix. I haven't done any testing comparing the
> models generated starting from Y or X, but they are very different. Is
> there
> a reason Spark begins the iteration by computing Y?
>
> Initializing both X and Y as is done in the Spark implementation seems
> unnecessary unless I'm overlooking some desired side-effect. Only the
> factor
> matrix which generates the other in the first iteration needs to be
> initialized.
>
> I also found that the product and user RDDs were being rebuilt many times
> over in my tests, even for tiny data sets. By persisting the RDD returned
> from updateFeatures() I was able to avoid a raft of duplicate computations.
> Is there a reason not to do this?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2704.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: possible bug in Spark's ALS implementation...

Posted by Sean Owen <so...@cloudera.com>.

It should be kept in mind that different implementations are rarely
strictly better, and that what works well in one type of data might
not in another. It also bears keeping in mind that several of these
differences just amount to different amounts of regularization, which
need not be a difference.

#1 is just a question of 'semantics' really and not an algorithmic difference.

#4 is more subtle and I think it is a small positive to use weighted
regularization -- for all types of data. It is not specific to
explicit data since it is based on counts. That said the effect of
this is mostly "more regularization across the boards", and can be
simulated with a higher global lambda.

There might be something to #2. I suspect the QR decomposition is more
accurate and that can matter here. I have no evidence for that though.
(The symmetry wasn't actually an issue in the end no?)

#3 won't affect the result so much as when to stop. That is stopping
in a principled way after X iterations is good, but it can be
simulated with just specifying X iterations.

If I were to experiment next I would focus on #2 and maybe #4.
--
Sean Owen | Director, Data Science | London


On Wed, Apr 2, 2014 at 9:06 AM, Michael Allman <ms...@allman.ms> wrote:
> Hi Nick,
>
> I don't have my spark clone in front of me, but OTOH the major differences
> are/were:
>
> 1. Oryx multiplies lambda by alpha.
>
> 2. Oryx uses a different matrix inverse algorithm. It maintains a certain
> symmetry which the Spark algo does not, however I don't think this
> difference has a real impact on the results.
>
> 3. Oryx supports the specification of a convergence threshold for
> termination of the algorithm, based on delta rmse on a subset of the
> training set if I recall correctly. I've been using that as the
> termination criterion instead of a fixed number of iterations.
>
> 4. Oryx uses the weighted regularization scheme you alluded to below,
> multiplying lambda by the number of ratings.
>
> I've patched the spark impl to support (4) but haven't pushed it to my
> clone on github. I think it would be a valuable feature to support
> officially. I'd also like to work on (3) but don't have time now. I've
> only been using Oryx the past couple of weeks.
>
> Cheers,
>
> Michael
>
> On Tue, 1 Apr 2014, Nick Pentreath [via Apache Spark User List] wrote:
>
>> Hi Michael
>> Would you mind setting out exactly what differences you did find between
>> the
>> Spark and Oryx implementations? Would be good to be clear on them, and
>> also
>> see if there are further tricks/enhancements from the Oryx one that can be
>> ported (such as the lambda * numRatings adjustment).
>>
>> N
>>
>>
>> On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <[hidden email]> wrote:
>>       I've been thoroughly investigating this issue over the past
>>       couple of days
>>       and have discovered quite a bit. For one thing, there is
>>       definitely (at
>>       least) one issue/bug in the Spark implementation that leads to
>>       incorrect
>>       results for models generated with rank > 1 or a large number of
>>       iterations.
>>       I will post a bug report with a thorough explanation this
>>       weekend or on
>>       Monday.
>>
>>       I believe I've been able to track down every difference between
>>       the Spark
>>       and Oryx implementations that lead to difference results. I made
>>       some
>>       adjustments to the spark implementation so that, given the same
>>       initial
>>       product/item vectors, the resulting model is identical to the
>>       one produced
>>       by Oryx within a small numerical tolerance. I've verified this
>>       for small
>>       data sets and am working on verifying this with some large data
>>       sets.
>>
>>       Aside from those already identified in this thread, another
>>       significant
>>       difference in the Spark implementation is that it begins the
>>       factorization
>>       process by computing the product matrix (Y) from the initial
>>       user matrix
>>       (X). Both of the papers on ALS referred to in this thread begin
>>       the process
>>       by computing the user matrix. I haven't done any testing
>>       comparing the
>>       models generated starting from Y or X, but they are very
>>       different. Is there
>>       a reason Spark begins the iteration by computing Y?
>>
>>       Initializing both X and Y as is done in the Spark implementation
>>       seems
>>       unnecessary unless I'm overlooking some desired side-effect.
>>       Only the factor
>>       matrix which generates the other in the first iteration needs to
>>       be
>>       initialized.
>>
>>       I also found that the product and user RDDs were being rebuilt
>>       many times
>>       over in my tests, even for tiny data sets. By persisting the RDD
>>       returned
>>       from updateFeatures() I was able to avoid a raft of duplicate
>>       computations.
>>       Is there a reason not to do this?
>>
>>       Thanks.
>>
>>
>>
>>       --
>>       View this message in
>> context:http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
>>       -ALS-implementation-tp2567p2704.html
>>       Sent from the Apache Spark User List mailing list archive at
>>       Nabble.com.
>>
>>
>>
>> ____________________________________________________________________________
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
>> -ALS-implementation-tp2567p3588.html
>> To unsubscribe from possible bug in Spark's ALS implementation..., click
>> here. NAML
>>
>>
>
> ________________________________
> View this message in context: Re: possible bug in Spark's ALS
> implementation...
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: possible bug in Spark's ALS implementation...

Posted by Nick Pentreath <ni...@gmail.com>.

The heuristic is multiply by number of ratings (i.e., the amount of
training examples for that user/item combo).

While the heuristic is used for CF settings, it is actually just
penalizing, as Sean said, "complex" models more, where we have more
data/connections between objects. I would say this can apply in
collaborative filtering in both explicit and implicit settings, and should
be quite generic and applicable to most settings (e.g. it is similar to
per-term learning rates and regularization in linear models).

@Sean what do you think?

We could add an additional parameter that is boolean as to whether to apply
this weighted regularization, with appropriate documentation as to why it's
there and what it's used for.


On Wed, Apr 2, 2014 at 9:19 AM, Debasish Das <de...@gmail.com>wrote:

> I think multiply by ratings is a heuristic that worked on rating related
> problems like netflix dataset or any other ratings datasets but the scope
> of NMF is much more broad than that....
>
> @Sean please correct me in case you don't agree...
>
> Definitely it's good to add all the rating dataset related optimizations
> but it should not be in the core algorithm but say RatingALS which extends
> upon core ALS...
>
>
>
> On Wed, Apr 2, 2014 at 12:06 AM, Michael Allman <ms...@allman.ms> wrote:
>
>> Hi Nick,
>>
>> I don't have my spark clone in front of me, but OTOH the major
>> differences
>> are/were:
>>
>> 1. Oryx multiplies lambda by alpha.
>>
>> 2. Oryx uses a different matrix inverse algorithm. It maintains a certain
>> symmetry which the Spark algo does not, however I don't think this
>> difference has a real impact on the results.
>>
>> 3. Oryx supports the specification of a convergence threshold for
>> termination of the algorithm, based on delta rmse on a subset of the
>> training set if I recall correctly. I've been using that as the
>> termination criterion instead of a fixed number of iterations.
>>
>> 4. Oryx uses the weighted regularization scheme you alluded to below,
>> multiplying lambda by the number of ratings.
>>
>> I've patched the spark impl to support (4) but haven't pushed it to my
>> clone on github. I think it would be a valuable feature to support
>> officially. I'd also like to work on (3) but don't have time now. I've
>> only been using Oryx the past couple of weeks.
>>
>> Cheers,
>>
>> Michael
>>
>> On Tue, 1 Apr 2014, Nick Pentreath [via Apache Spark User List] wrote:
>>
>> > Hi Michael
>> > Would you mind setting out exactly what differences you did find
>> between the
>> > Spark and Oryx implementations? Would be good to be clear on them, and
>> also
>> > see if there are further tricks/enhancements from the Oryx one that can
>> be
>> > ported (such as the lambda * numRatings adjustment).
>> >
>> > N
>> >
>> >
>> > On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <[hidden email]> wrote:
>> >       I've been thoroughly investigating this issue over the past
>> >       couple of days
>> >       and have discovered quite a bit. For one thing, there is
>> >       definitely (at
>> >       least) one issue/bug in the Spark implementation that leads to
>> >       incorrect
>> >       results for models generated with rank > 1 or a large number of
>> >       iterations.
>> >       I will post a bug report with a thorough explanation this
>> >       weekend or on
>> >       Monday.
>> >
>> >       I believe I've been able to track down every difference between
>> >       the Spark
>> >       and Oryx implementations that lead to difference results. I made
>> >       some
>> >       adjustments to the spark implementation so that, given the same
>> >       initial
>> >       product/item vectors, the resulting model is identical to the
>> >       one produced
>> >       by Oryx within a small numerical tolerance. I've verified this
>> >       for small
>> >       data sets and am working on verifying this with some large data
>> >       sets.
>> >
>> >       Aside from those already identified in this thread, another
>> >       significant
>> >       difference in the Spark implementation is that it begins the
>> >       factorization
>> >       process by computing the product matrix (Y) from the initial
>> >       user matrix
>> >       (X). Both of the papers on ALS referred to in this thread begin
>> >       the process
>> >       by computing the user matrix. I haven't done any testing
>> >       comparing the
>> >       models generated starting from Y or X, but they are very
>> >       different. Is there
>> >       a reason Spark begins the iteration by computing Y?
>> >
>> >       Initializing both X and Y as is done in the Spark implementation
>> >       seems
>> >       unnecessary unless I'm overlooking some desired side-effect.
>> >       Only the factor
>> >       matrix which generates the other in the first iteration needs to
>> >       be
>> >       initialized.
>> >
>> >       I also found that the product and user RDDs were being rebuilt
>> >       many times
>> >       over in my tests, even for tiny data sets. By persisting the RDD
>> >       returned
>> >       from updateFeatures() I was able to avoid a raft of duplicate
>> >       computations.
>> >       Is there a reason not to do this?
>> >
>> >       Thanks.
>> >
>> >
>> >
>> >       --
>> >       View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
>> >       -ALS-implementation-tp2567p2704.html
>> >       Sent from the Apache Spark User List mailing list archive at
>> >       Nabble.com.
>> >
>> >
>> >
>> ____________________________________________________________________________
>>
>> > If you reply to this email, your message will be added to the
>> discussion
>> > below:
>> >
>> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
>> > -ALS-implementation-tp2567p3588.html
>> > To unsubscribe from possible bug in Spark's ALS implementation...,
>> click
>> > here. NAML
>> >
>> >
>>
>> ------------------------------
>> View this message in context: Re: possible bug in Spark's ALS
>> implementation...<http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p3619.html>
>>
>> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>
>
>

Re: possible bug in Spark's ALS implementation...

Posted by Debasish Das <de...@gmail.com>.

I think multiply by ratings is a heuristic that worked on rating related
problems like netflix dataset or any other ratings datasets but the scope
of NMF is much more broad than that....

@Sean please correct me in case you don't agree...

Definitely it's good to add all the rating dataset related optimizations
but it should not be in the core algorithm but say RatingALS which extends
upon core ALS...



On Wed, Apr 2, 2014 at 12:06 AM, Michael Allman <ms...@allman.ms> wrote:

> Hi Nick,
>
> I don't have my spark clone in front of me, but OTOH the major differences
> are/were:
>
> 1. Oryx multiplies lambda by alpha.
>
> 2. Oryx uses a different matrix inverse algorithm. It maintains a certain
> symmetry which the Spark algo does not, however I don't think this
> difference has a real impact on the results.
>
> 3. Oryx supports the specification of a convergence threshold for
> termination of the algorithm, based on delta rmse on a subset of the
> training set if I recall correctly. I've been using that as the
> termination criterion instead of a fixed number of iterations.
>
> 4. Oryx uses the weighted regularization scheme you alluded to below,
> multiplying lambda by the number of ratings.
>
> I've patched the spark impl to support (4) but haven't pushed it to my
> clone on github. I think it would be a valuable feature to support
> officially. I'd also like to work on (3) but don't have time now. I've
> only been using Oryx the past couple of weeks.
>
> Cheers,
>
> Michael
>
> On Tue, 1 Apr 2014, Nick Pentreath [via Apache Spark User List] wrote:
>
> > Hi Michael
> > Would you mind setting out exactly what differences you did find between
> the
> > Spark and Oryx implementations? Would be good to be clear on them, and
> also
> > see if there are further tricks/enhancements from the Oryx one that can
> be
> > ported (such as the lambda * numRatings adjustment).
> >
> > N
> >
> >
> > On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <[hidden email]> wrote:
> >       I've been thoroughly investigating this issue over the past
> >       couple of days
> >       and have discovered quite a bit. For one thing, there is
> >       definitely (at
> >       least) one issue/bug in the Spark implementation that leads to
> >       incorrect
> >       results for models generated with rank > 1 or a large number of
> >       iterations.
> >       I will post a bug report with a thorough explanation this
> >       weekend or on
> >       Monday.
> >
> >       I believe I've been able to track down every difference between
> >       the Spark
> >       and Oryx implementations that lead to difference results. I made
> >       some
> >       adjustments to the spark implementation so that, given the same
> >       initial
> >       product/item vectors, the resulting model is identical to the
> >       one produced
> >       by Oryx within a small numerical tolerance. I've verified this
> >       for small
> >       data sets and am working on verifying this with some large data
> >       sets.
> >
> >       Aside from those already identified in this thread, another
> >       significant
> >       difference in the Spark implementation is that it begins the
> >       factorization
> >       process by computing the product matrix (Y) from the initial
> >       user matrix
> >       (X). Both of the papers on ALS referred to in this thread begin
> >       the process
> >       by computing the user matrix. I haven't done any testing
> >       comparing the
> >       models generated starting from Y or X, but they are very
> >       different. Is there
> >       a reason Spark begins the iteration by computing Y?
> >
> >       Initializing both X and Y as is done in the Spark implementation
> >       seems
> >       unnecessary unless I'm overlooking some desired side-effect.
> >       Only the factor
> >       matrix which generates the other in the first iteration needs to
> >       be
> >       initialized.
> >
> >       I also found that the product and user RDDs were being rebuilt
> >       many times
> >       over in my tests, even for tiny data sets. By persisting the RDD
> >       returned
> >       from updateFeatures() I was able to avoid a raft of duplicate
> >       computations.
> >       Is there a reason not to do this?
> >
> >       Thanks.
> >
> >
> >
> >       --
> >       View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
> >       -ALS-implementation-tp2567p2704.html
> >       Sent from the Apache Spark User List mailing list archive at
> >       Nabble.com.
> >
> >
> >
> ____________________________________________________________________________
>
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
> > -ALS-implementation-tp2567p3588.html
> > To unsubscribe from possible bug in Spark's ALS implementation..., click
> > here. NAML
> >
> >
>
> ------------------------------
> View this message in context: Re: possible bug in Spark's ALS
> implementation...<http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p3619.html>
>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>

Re: possible bug in Spark's ALS implementation...

Posted by Michael Allman <ms...@allman.ms>.

Hi Nick,

I don't have my spark clone in front of me, but OTOH the major differences 
are/were:

1. Oryx multiplies lambda by alpha.

2. Oryx uses a different matrix inverse algorithm. It maintains a certain 
symmetry which the Spark algo does not, however I don't think this 
difference has a real impact on the results.

3. Oryx supports the specification of a convergence threshold for 
termination of the algorithm, based on delta rmse on a subset of the 
training set if I recall correctly. I've been using that as the 
termination criterion instead of a fixed number of iterations.

4. Oryx uses the weighted regularization scheme you alluded to below, 
multiplying lambda by the number of ratings.

I've patched the spark impl to support (4) but haven't pushed it to my 
clone on github. I think it would be a valuable feature to support 
officially. I'd also like to work on (3) but don't have time now. I've 
only been using Oryx the past couple of weeks.

Cheers,

Michael

On Tue, 1 Apr 2014, Nick Pentreath [via Apache Spark User List] wrote:

> Hi Michael
> Would you mind setting out exactly what differences you did find between the
> Spark and Oryx implementations? Would be good to be clear on them, and also
> see if there are further tricks/enhancements from the Oryx one that can be
> ported (such as the lambda * numRatings adjustment).
> 
> N
> 
> 
> On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <[hidden email]> wrote:
>       I've been thoroughly investigating this issue over the past
>       couple of days
>       and have discovered quite a bit. For one thing, there is
>       definitely (at
>       least) one issue/bug in the Spark implementation that leads to
>       incorrect
>       results for models generated with rank > 1 or a large number of
>       iterations.
>       I will post a bug report with a thorough explanation this
>       weekend or on
>       Monday.
>
>       I believe I've been able to track down every difference between
>       the Spark
>       and Oryx implementations that lead to difference results. I made
>       some
>       adjustments to the spark implementation so that, given the same
>       initial
>       product/item vectors, the resulting model is identical to the
>       one produced
>       by Oryx within a small numerical tolerance. I've verified this
>       for small
>       data sets and am working on verifying this with some large data
>       sets.
>
>       Aside from those already identified in this thread, another
>       significant
>       difference in the Spark implementation is that it begins the
>       factorization
>       process by computing the product matrix (Y) from the initial
>       user matrix
>       (X). Both of the papers on ALS referred to in this thread begin
>       the process
>       by computing the user matrix. I haven't done any testing
>       comparing the
>       models generated starting from Y or X, but they are very
>       different. Is there
>       a reason Spark begins the iteration by computing Y?
>
>       Initializing both X and Y as is done in the Spark implementation
>       seems
>       unnecessary unless I'm overlooking some desired side-effect.
>       Only the factor
>       matrix which generates the other in the first iteration needs to
>       be
>       initialized.
>
>       I also found that the product and user RDDs were being rebuilt
>       many times
>       over in my tests, even for tiny data sets. By persisting the RDD
>       returned
>       from updateFeatures() I was able to avoid a raft of duplicate
>       computations.
>       Is there a reason not to do this?
>
>       Thanks.
> 
> 
>
>       --
>       View this message in context:http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
>       -ALS-implementation-tp2567p2704.html
>       Sent from the Apache Spark User List mailing list archive at
>       Nabble.com.
> 
> 
> ____________________________________________________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
> -ALS-implementation-tp2567p3588.html
> To unsubscribe from possible bug in Spark's ALS implementation..., click
> here. NAML
> 
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p3619.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.