You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2011/11/18 14:24:29 UTC

lambda overfitting param and ParallelALSFactorizationJob -- suggested value?

Sebastian do you have any thoughts on the right starting value for lambda,
the overfitting param in your ALS-based implementation? Yes I'm looking at
the same Koren paper you had mentioned.

I don't have a good sense of whether the loss from that extra term is
supposed to be "much more important", "important as", or "much less
important" than that modified least-square loss function.
That is -- is it generally near 1, or not?

Sean

Re: lambda overfitting param and ParallelALSFactorizationJob -- suggested value?

Posted by Sebastian Schelter <ss...@apache.org>.

The right value for lambda depends on the data and the confidence
function and should be chosen via cross-validation.

Coincidentally, I'm currently watching lecture X (10) of
http://ml-class.org which exactly talks about ways to choose the
regularization parameter :)


--sebastian

On 18.11.2011 14:24, Sean Owen wrote:
> Sebastian do you have any thoughts on the right starting value for lambda,
> the overfitting param in your ALS-based implementation? Yes I'm looking at
> the same Koren paper you had mentioned.
> 
> I don't have a good sense of whether the loss from that extra term is
> supposed to be "much more important", "important as", or "much less
> important" than that modified least-square loss function.
> That is -- is it generally near 1, or not?
> 
> Sean
>

Re: lambda overfitting param and ParallelALSFactorizationJob -- suggested value?

Posted by Ted Dunning <te...@gmail.com>.

With decompositions like X * Y', there are a variety of quality measures
that you can impose.  For ALS, the quality measure is least-squared
reconstruction of the observed cells of A.  As such, X and Y tend to be
orthogonal, but not both orthonormal since they include the information in
S.

On Sat, Nov 26, 2011 at 8:53 AM, Sean Owen <sr...@gmail.com> wrote:

> I have a follow-on question about the alternating least-squares method.
>
> I understand it's approximating A = X * YT.
>
> In an SVD, where A = U * S * VT, U and V have orthogonal columns, so
> VT * V = I for example, so V projects A into the user-feature space
> (leaving aside S for the moment).
>
> I don't know enough to understand whether the same is true of YT in
> the simpler method. It stands to reason that the rows of YT (features)
> ought to tend to be orthogonal if the process does its job, so YT * Y
> ought to be about the identity matrix, etc.
>
> Is this so, or, can somebody name the basic piece of theory I can read
> to understand this one better?
>
>
> On Fri, Nov 18, 2011 at 1:24 PM, Sean Owen <sr...@gmail.com> wrote:
> > Sebastian do you have any thoughts on the right starting value for
> lambda,
> > the overfitting param in your ALS-based implementation? Yes I'm looking
> at
> > the same Koren paper you had mentioned.
> > I don't have a good sense of whether the loss from that extra term is
> > supposed to be "much more important", "important as", or "much less
> > important" than that modified least-square loss function.
> > That is -- is it generally near 1, or not?
> > Sean
>

Re: lambda overfitting param and ParallelALSFactorizationJob -- suggested value?

Posted by Sean Owen <sr...@gmail.com>.

I have a follow-on question about the alternating least-squares method.

I understand it's approximating A = X * YT.

In an SVD, where A = U * S * VT, U and V have orthogonal columns, so
VT * V = I for example, so V projects A into the user-feature space
(leaving aside S for the moment).

I don't know enough to understand whether the same is true of YT in
the simpler method. It stands to reason that the rows of YT (features)
ought to tend to be orthogonal if the process does its job, so YT * Y
ought to be about the identity matrix, etc.

Is this so, or, can somebody name the basic piece of theory I can read
to understand this one better?

On Fri, Nov 18, 2011 at 1:24 PM, Sean Owen <sr...@gmail.com> wrote:
> Sebastian do you have any thoughts on the right starting value for lambda,
> the overfitting param in your ALS-based implementation? Yes I'm looking at
> the same Koren paper you had mentioned.
> I don't have a good sense of whether the loss from that extra term is
> supposed to be "much more important", "important as", or "much less
> important" than that modified least-square loss function.
> That is -- is it generally near 1, or not?
> Sean

Re: lambda overfitting param and ParallelALSFactorizationJob -- suggested value?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

in my experience, 'much less important' is probably the more accurate
description out of 3.
noisy data will usual result in crossvalidation optimum with higher
lambda values and vice versa. (speaking from SGD experience and my
specific data).

in general case, you could probably try to infer it via bisect (which
many people do manually). Mahout online SGD is using adaptive
step-recorded convex optimization for gamma and lambda afaik, but with
distributed ALS online fashion adaptivity is probably not possible, so
n-way bisect iterations are probably most promising.

I'd venture to say 1 is quite a big number, assuming features are scaled.

On Fri, Nov 18, 2011 at 5:24 AM, Sean Owen <sr...@gmail.com> wrote:
> Sebastian do you have any thoughts on the right starting value for lambda,
> the overfitting param in your ALS-based implementation? Yes I'm looking at
> the same Koren paper you had mentioned.
>
> I don't have a good sense of whether the loss from that extra term is
> supposed to be "much more important", "important as", or "much less
> important" than that modified least-square loss function.
> That is -- is it generally near 1, or not?
>
> Sean
>