You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Dominik Huebner <co...@dhuebner.com> on 2013/03/25 03:19:33 UTC

Mathematical background of ALS recommenders

It's quite hard for me to get the mathematical concepts of the ALS
recommenders. It would be great if someone could help me to figure out
the details. This is my current status:

1. The item-feature (M) matrix is initialized using the average ratings
and random values (explicit case)

2. The user-feature (U) matrix is solved using the partial derivative of
the error function with respect to u_i (the columns of row-vectors of U)

Supposed we use as many features as items are known and the error
function does not use any regularization. Would U be solved within the
first iteration? If not, I do not understand why more than one iteration
is needed. 
Furthermore, I believe to have understood that using fewer features than
items and also applying regularization, does not allow to solve U in a
way that the stopping criterion can be met after only one iteration.
Thus, iteration is required to gradually converge to the stopping
criterion.

I hope I have pointed out my problems clearly enough.

Re: Mathematical background of ALS recommenders

Posted by Koobas <ko...@gmail.com>.

You managed to ask the question in a convoluted way.
I am not sure if you don't understand the principles or the intricacies.
The high level answer is the following.
ALS is just like SVD, only it is not SVD.
It produces a low rank approximation of the user-item matrix.
I.e., represents the user-item matrix as a product of two smaller matrices,
item-feature times feature-user.
It does that by initializing them to some random junk,
and then repeatedly solving the least square problem.
Sets one matrix, solves for the other, sets the other solves for the first.
It converges rapidly.
For Movie Lens data sets, it's like three iterations and you're done.
Six to be on the safe size.
Ten is an overkill.
I don't see why you would expect it to be done in one iteration, though.
You are starting with random matrices. You have to iterate.
The way I see it, regularization is just a good practice.
It's there to prevent over-fitting, which I don't think is an issue.
unless you are using a very high number of features,
basically approaching the number of users or items.
The textbook application of regularization with lambda=1 and the problem is
gone.
You can have more features than users or vectors, and your user-item matrix
will still be reconstructed correctly.
You can also keep iterating as much as you want. The problem is gone.
At least for the matrices I tried - Movie Lens (implicit feedback - no
ratings).
Not sure my answer is helpful, but just giving it a shot.
I am sure other will chip in.
Koobas

On Sun, Mar 24, 2013 at 10:19 PM, Dominik Huebner <co...@dhuebner.com>wrote:

> It's quite hard for me to get the mathematical concepts of the ALS
> recommenders. It would be great if someone could help me to figure out
> the details. This is my current status:
>
> 1. The item-feature (M) matrix is initialized using the average ratings
> and random values (explicit case)
>
> 2. The user-feature (U) matrix is solved using the partial derivative of
> the error function with respect to u_i (the columns of row-vectors of U)
>
> Supposed we use as many features as items are known and the error
> function does not use any regularization. Would U be solved within the
> first iteration? If not, I do not understand why more than one iteration
> is needed.
> Furthermore, I believe to have understood that using fewer features than
> items and also applying regularization, does not allow to solve U in a
> way that the stopping criterion can be met after only one iteration.
> Thus, iteration is required to gradually converge to the stopping
> criterion.
>
> I hope I have pointed out my problems clearly enough.
>
>

Re: Mathematical background of ALS recommenders

Posted by Abhijith CHandraprabhu <ab...@gmail.com>.

ALS is a way of solving an matrix equivalent of an regular linear least
squares problem. In a regular least squares problem we solve for Ax=b,
i.e., ||b - Ax|| in l2 norm. In the case of ALS we try to solve ||R - UM ||
in frobenius norm. Now we try to reduce ||R - UM||, to the kind ||b - Ax||,
by solving for each rows of U in a regular least squares sense.

Example:
Let us consider the first row of R, which will have all the ratings given
by the
fist user. Let us assume the user has seen movies 1, 213, 315, 1744, 2133,
7044 and
12344, so in a total of 7 movies. Forming a vector of the known ratings for
the first
user we have [1 213 315 1744 2133 7044 12344] , let this be denoted by b.
We initialize the M matrix in a particular way, and as shown in the figure
above
it has n_m(total number of movies) rows and n_f(reduced factors) columns.
The matrix U has n_u(total number of users) rows and n_f columns. For the
case of the first user we can form a sub-matrix of M , denote this by M_u .
This matrix M_u will have all the rows from M corresponding to the movies
seen by the user u, the first user in this case. Let u be the row vector
from U corresponding to the considered user, .i.e, first user. Now we have
three components b,M_u and u, so we have a regular least squares problem
for ||b - Mu||. This is done for all the users, hence forming the first
estimate of the U matrix, since this cant be the final U we need to
alternatively continue to solve for U and M.

You could try to just do an SVD on R, and initialize M=S*V', and use this M
in ALS, in this case you might not need many iterations as we dont
initialize with random values in M. Any approximation method involving
random initial values needs a certain minimum number of iterations to tend
towards the right values, as we in each iteration minimize the error to a
certain extent only. We cant reduce the whole error in just one iteration.

I am also new to ALS, this is my initial understanding.

On Mon, Mar 25, 2013 at 6:47 AM, Ted Dunning <te...@gmail.com> wrote:

> If you have an observation matrix that actually is low rank and you start
> with the exact value of U, then one iteration on M will suffice to solve
> the system.
>
> Likewise, if you have M, one iteration on U will suffice.
>
> This holds regardless of the number of features of M or U (which must
> match, btw) as long as the rank of the observation matrix is equal or less
> to that number of features.
>
> Otherwise, more than one iteration is likely to be required.
>
> On Mon, Mar 25, 2013 at 3:19 AM, Dominik Huebner <contact@dhuebner.com
> >wrote:
>
> > It's quite hard for me to get the mathematical concepts of the ALS
> > recommenders. It would be great if someone could help me to figure out
> > the details. This is my current status:
> >
> > 1. The item-feature (M) matrix is initialized using the average ratings
> > and random values (explicit case)
> >
> > 2. The user-feature (U) matrix is solved using the partial derivative of
> > the error function with respect to u_i (the columns of row-vectors of U)
> >
> > Supposed we use as many features as items are known and the error
> > function does not use any regularization. Would U be solved within the
> > first iteration? If not, I do not understand why more than one iteration
> > is needed.
> > Furthermore, I believe to have understood that using fewer features than
> > items and also applying regularization, does not allow to solve U in a
> > way that the stopping criterion can be met after only one iteration.
> > Thus, iteration is required to gradually converge to the stopping
> > criterion.
> >
> > I hope I have pointed out my problems clearly enough.
> >
> >
>

-- 
Best regards,
Abhijith

Re: Mathematical background of ALS recommenders

Posted by Ted Dunning <te...@gmail.com>.

If you have an observation matrix that actually is low rank and you start
with the exact value of U, then one iteration on M will suffice to solve
the system.

Likewise, if you have M, one iteration on U will suffice.

This holds regardless of the number of features of M or U (which must
match, btw) as long as the rank of the observation matrix is equal or less
to that number of features.

Otherwise, more than one iteration is likely to be required.

On Mon, Mar 25, 2013 at 3:19 AM, Dominik Huebner <co...@dhuebner.com>wrote:

> It's quite hard for me to get the mathematical concepts of the ALS
> recommenders. It would be great if someone could help me to figure out
> the details. This is my current status:
>
> 1. The item-feature (M) matrix is initialized using the average ratings
> and random values (explicit case)
>
> 2. The user-feature (U) matrix is solved using the partial derivative of
> the error function with respect to u_i (the columns of row-vectors of U)
>
> Supposed we use as many features as items are known and the error
> function does not use any regularization. Would U be solved within the
> first iteration? If not, I do not understand why more than one iteration
> is needed.
> Furthermore, I believe to have understood that using fewer features than
> items and also applying regularization, does not allow to solve U in a
> way that the stopping criterion can be met after only one iteration.
> Thus, iteration is required to gradually converge to the stopping
> criterion.
>
> I hope I have pointed out my problems clearly enough.
>
>

Re: Mathematical background of ALS recommenders

Posted by Ted Dunning <te...@gmail.com>.

No.  We don't.  We used to use Lanczos, but that has improved.

On Mon, Mar 25, 2013 at 4:43 PM, Abhijith CHandraprabhu <abhijithc@gmail.com
> wrote:

> Sorry, I actually meant svds(sparse SVD). I think in mahout they use
> Lanczos also.
>
>
> On Mon, Mar 25, 2013 at 4:25 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Yes.  But SSVD != Lanczos.  Lanczos is vector at at time sequential like
> > you said.  SSVD does all the vectors in one go.  That one go requires a
> few
> > steps, but does not require O(k) iterations.
> >
> > On Mon, Mar 25, 2013 at 12:16 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > OK, the 'k iterations' happen inline in one job? I thought the Lanczos
> > > algorithm found the k eigenvalues/vectors one after the other. Yeah I
> > > suppose that doesn't literally mean k map/reduce jobs. Yes the broader
> > > idea was whether or not you might get something useful out of ALS
> > > earlier.
> > >
> > > On Mon, Mar 25, 2013 at 11:06 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > > SVD need not be iterative at all. The SSVD code uses roughly 5
> > > map-reduces
> > > > to give you a high quality SVD approximation.  There is the option to
> > add
> > > > 0, 1 or more extra iterations, but it is rare to need more than 1.
> > > >
> > > > ALS could well be of use after less work.  This is especially try for
> > > > incremental solutions.
> > >
> >
>
>
>
> --
> Best regards,
> Abhijith
>

Re: Mathematical background of ALS recommenders

Posted by Abhijith CHandraprabhu <ab...@gmail.com>.

Sorry, I actually meant svds(sparse SVD). I think in mahout they use
Lanczos also.


On Mon, Mar 25, 2013 at 4:25 PM, Ted Dunning <te...@gmail.com> wrote:

> Yes.  But SSVD != Lanczos.  Lanczos is vector at at time sequential like
> you said.  SSVD does all the vectors in one go.  That one go requires a few
> steps, but does not require O(k) iterations.
>
> On Mon, Mar 25, 2013 at 12:16 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > OK, the 'k iterations' happen inline in one job? I thought the Lanczos
> > algorithm found the k eigenvalues/vectors one after the other. Yeah I
> > suppose that doesn't literally mean k map/reduce jobs. Yes the broader
> > idea was whether or not you might get something useful out of ALS
> > earlier.
> >
> > On Mon, Mar 25, 2013 at 11:06 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > SVD need not be iterative at all. The SSVD code uses roughly 5
> > map-reduces
> > > to give you a high quality SVD approximation.  There is the option to
> add
> > > 0, 1 or more extra iterations, but it is rare to need more than 1.
> > >
> > > ALS could well be of use after less work.  This is especially try for
> > > incremental solutions.
> >
>



-- 
Best regards,
Abhijith

Re: Mathematical background of ALS recommenders

Posted by Ted Dunning <te...@gmail.com>.

Yes.  But SSVD != Lanczos.  Lanczos is vector at at time sequential like
you said.  SSVD does all the vectors in one go.  That one go requires a few
steps, but does not require O(k) iterations.

On Mon, Mar 25, 2013 at 12:16 PM, Sean Owen <sr...@gmail.com> wrote:

> OK, the 'k iterations' happen inline in one job? I thought the Lanczos
> algorithm found the k eigenvalues/vectors one after the other. Yeah I
> suppose that doesn't literally mean k map/reduce jobs. Yes the broader
> idea was whether or not you might get something useful out of ALS
> earlier.
>
> On Mon, Mar 25, 2013 at 11:06 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > SVD need not be iterative at all. The SSVD code uses roughly 5
> map-reduces
> > to give you a high quality SVD approximation.  There is the option to add
> > 0, 1 or more extra iterations, but it is rare to need more than 1.
> >
> > ALS could well be of use after less work.  This is especially try for
> > incremental solutions.
>

Re: Mathematical background of ALS recommenders

Posted by Sean Owen <sr...@gmail.com>.

OK, the 'k iterations' happen inline in one job? I thought the Lanczos
algorithm found the k eigenvalues/vectors one after the other. Yeah I
suppose that doesn't literally mean k map/reduce jobs. Yes the broader
idea was whether or not you might get something useful out of ALS
earlier.

On Mon, Mar 25, 2013 at 11:06 AM, Ted Dunning <te...@gmail.com> wrote:
> SVD need not be iterative at all. The SSVD code uses roughly 5 map-reduces
> to give you a high quality SVD approximation.  There is the option to add
> 0, 1 or more extra iterations, but it is rare to need more than 1.
>
> ALS could well be of use after less work.  This is especially try for
> incremental solutions.

Re: Mathematical background of ALS recommenders

Posted by Ted Dunning <te...@gmail.com>.

Even more in-line.

On Mon, Mar 25, 2013 at 11:46 AM, Sean Owen <sr...@gmail.com> wrote:

> Points from across several e-mails --
>
> The initial item-feature matrix can be just random unit vectors too. I
> have slightly better results with that.
>
> You are finding the least-squares solution of A = U M' for U given A
> and M. Yes you can derive that analytically as the zero of the
> derivative of the error function.
>
> With m users and n items, and k features, where k=n, then I suppose
> you don't need any iterations at all since there is a trivial
> solution: U = A, M = I(n) (the nxn identity matrix). You would not
> find this on the first iteration, however, if you followed the
> algorithm, because you would be starting from some random starting
> point. But if you initialized M to the identity matrix, yes you'd find
> the exact solution immediately.


> Yes it is an iterative algorithm and you have to define a convergence
> criterion. I use average absolute difference in (U M') entries from
> one iteration to the next. (Well, a sample.) It's possible that you
> reach your criterion in 1 iteration, or, not. It depends on the
> criterion. Usually when you restart ALS on updated data, you use the
> previous M matrix as a starting point. If the change in data is small,
> one iteration should usually leave you still "converged" actually.
> But, from random starting point -- not typical.
>
> ALS is similar to SVD only in broad terms. The SVD is not always used
> to make a low-rank factorization, and, its outputs carry more
> guarantees -- they are orthonormal bases because it has factored out
> scaling factors into the third matrix Sigma. I think of ALS as more
> simplistic and therefore possibly faster.


Possibly.  Or not.  The stochastic project method gives you an SVD pretty
quickly.


> WIth k features I believe
> (?) the SVD is necessarily a k-iteration process at least, whereas ALS
> might be of use after 1 iteration.


SVD need not be iterative at all. The SSVD code uses roughly 5 map-reduces
to give you a high quality SVD approximation.  There is the option to add
0, 1 or more extra iterations, but it is rare to need more than 1.

ALS could well be of use after less work.  This is especially try for
incremental solutions.


> The SVD is not a "shortcut" for
> ALS. If you go to the trouble of a full SVD, you can certainly use
> that factorization as-is though.
>
> You need regularization.
>

>
> It should be pointed out that the "ALS" often spoken of here is not
> one that factors the input matrix A. There's a variant, that I have
> had good results with, for 'implicit' feedback. There, you are
> actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
> values in A as weights in the loss function. You're reconstructing
> "interacted or not" and using input value as a confidence measure.
> This "works" for ratings although the interpretation in this variant
> doesn't line up with the nature of ratings. It works quite nicely for
> things like clicks, etc.
>
> (Much more can be said on this point.)
>
>
>
> On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <co...@dhuebner.com>
> wrote:
> > It's quite hard for me to get the mathematical concepts of the ALS
> > recommenders. It would be great if someone could help me to figure out
> > the details. This is my current status:
> >
> > 1. The item-feature (M) matrix is initialized using the average ratings
> > and random values (explicit case)
> >
> > 2. The user-feature (U) matrix is solved using the partial derivative of
> > the error function with respect to u_i (the columns of row-vectors of U)
> >
> > Supposed we use as many features as items are known and the error
> > function does not use any regularization. Would U be solved within the
> > first iteration? If not, I do not understand why more than one iteration
> > is needed.
> > Furthermore, I believe to have understood that using fewer features than
> > items and also applying regularization, does not allow to solve U in a
> > way that the stopping criterion can be met after only one iteration.
> > Thus, iteration is required to gradually converge to the stopping
> > criterion.
> >
> > I hope I have pointed out my problems clearly enough.
> >
>

Re: Mathematical background of ALS recommenders

Posted by Koobas <ko...@gmail.com>.

On Mon, Mar 25, 2013 at 10:43 AM, Sean Owen <sr...@gmail.com> wrote:

> If your input is clicks, carts, etc. yes you ought to get generally
> good results from something meant to consume implicit feedback, like
> ALS (for implicit feedback, yes there are at least two main variants).
> I think you are talking about the implicit version since you mention
> 0/1.
>
> lambda is the regularization parameter. It is defined a bit
> differently in the various papers though. Test a few values if you
> can.
> But you said "no weights in the regularization"... what do you mean?
> you don't want to disable regularization entirely.
>
> I misspoke.
I meant lambda=1.


> On Mon, Mar 25, 2013 at 2:14 PM, Koobas <ko...@gmail.com> wrote:
> > On Mon, Mar 25, 2013 at 9:52 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> On Mon, Mar 25, 2013 at 1:41 PM, Koobas <ko...@gmail.com> wrote:
> >> >> But the assumption works nicely for click-like data. Better still
> when
> >> >> you can "weakly" prefer to reconstruct the 0 for missing observations
> >> >> and much more strongly prefer to reconstruct the "1" for observed
> >> >> data.
> >> >>
> >> >
> >> > This does seem intuitive.
> >> > How does the benefit manifest itself?
> >> > In lowering the RMSE of reconstructing the interaction matrix?
> >> > Are there any indicators that it results in better recommendations?
> >> > Koobas
> >>
> >> In this approach you are no longer reconstructing the interaction
> >> matrix, so there is no RMSE vs the interaction matrix. You're
> >> reconstructing a matrix of 0 and 1. Because entries are weighted
> >> differently, you're not even minimizing RMSE over that matrix -- the
> >> point is to take some errors more seriously than others. You're
> >> minimizing a *weighted* RMSE, yes.
> >>
> >> Yes of course the goal is better recommendations.  This broader idea
> >> is harder to measure. You can use mean average precision to measure
> >> the tendency to predict back interactions that were held out.
> >>
> >> Is it better? depends on better than *what*. Applying algorithms that
> >> treat input like ratings doesn't work as well on click-like data. The
> >> main problem is that these will tend to pay too much attention to
> >> large values. For example if an item was clicked 1000 times, and you
> >> are trying to actually reconstruct that "1000", then a 10% error
> >> "costs" (0.1*1000)^2 = 10000. But a 10% error in reconstructing an
> >> item that was clicked once "costs" (0.1*1)^2 = 0.01. The former is
> >> considered a million times more important error-wise than the latter,
> >> even though the intuition is that it's just 1000 times more important.
> >>
> >> Better than algorithms that ignore the weight entirely -- yes probably
> >> if only because you are using more information. But as in all things
> >> "it depends".
> >>
> >
> > Let's say the following.
> > Classic market basket.
> > Implicit feedback.
> > Ones and zeros in the input matrix, no weights in the regularization,
> > lambda=1.
> > What I will get is:
> > A) a reasonable recommender,
> > B) a joke of a recommender.
>

Re: Mathematical background of ALS recommenders

Posted by Sean Owen <sr...@gmail.com>.

If your input is clicks, carts, etc. yes you ought to get generally
good results from something meant to consume implicit feedback, like
ALS (for implicit feedback, yes there are at least two main variants).
I think you are talking about the implicit version since you mention
0/1.

lambda is the regularization parameter. It is defined a bit
differently in the various papers though. Test a few values if you
can.
But you said "no weights in the regularization"... what do you mean?
you don't want to disable regularization entirely.

On Mon, Mar 25, 2013 at 2:14 PM, Koobas <ko...@gmail.com> wrote:
> On Mon, Mar 25, 2013 at 9:52 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> On Mon, Mar 25, 2013 at 1:41 PM, Koobas <ko...@gmail.com> wrote:
>> >> But the assumption works nicely for click-like data. Better still when
>> >> you can "weakly" prefer to reconstruct the 0 for missing observations
>> >> and much more strongly prefer to reconstruct the "1" for observed
>> >> data.
>> >>
>> >
>> > This does seem intuitive.
>> > How does the benefit manifest itself?
>> > In lowering the RMSE of reconstructing the interaction matrix?
>> > Are there any indicators that it results in better recommendations?
>> > Koobas
>>
>> In this approach you are no longer reconstructing the interaction
>> matrix, so there is no RMSE vs the interaction matrix. You're
>> reconstructing a matrix of 0 and 1. Because entries are weighted
>> differently, you're not even minimizing RMSE over that matrix -- the
>> point is to take some errors more seriously than others. You're
>> minimizing a *weighted* RMSE, yes.
>>
>> Yes of course the goal is better recommendations.  This broader idea
>> is harder to measure. You can use mean average precision to measure
>> the tendency to predict back interactions that were held out.
>>
>> Is it better? depends on better than *what*. Applying algorithms that
>> treat input like ratings doesn't work as well on click-like data. The
>> main problem is that these will tend to pay too much attention to
>> large values. For example if an item was clicked 1000 times, and you
>> are trying to actually reconstruct that "1000", then a 10% error
>> "costs" (0.1*1000)^2 = 10000. But a 10% error in reconstructing an
>> item that was clicked once "costs" (0.1*1)^2 = 0.01. The former is
>> considered a million times more important error-wise than the latter,
>> even though the intuition is that it's just 1000 times more important.
>>
>> Better than algorithms that ignore the weight entirely -- yes probably
>> if only because you are using more information. But as in all things
>> "it depends".
>>
>
> Let's say the following.
> Classic market basket.
> Implicit feedback.
> Ones and zeros in the input matrix, no weights in the regularization,
> lambda=1.
> What I will get is:
> A) a reasonable recommender,
> B) a joke of a recommender.

Re: Mathematical background of ALS recommenders

Posted by Koobas <ko...@gmail.com>.

On Mon, Mar 25, 2013 at 9:52 AM, Sean Owen <sr...@gmail.com> wrote:

> On Mon, Mar 25, 2013 at 1:41 PM, Koobas <ko...@gmail.com> wrote:
> >> But the assumption works nicely for click-like data. Better still when
> >> you can "weakly" prefer to reconstruct the 0 for missing observations
> >> and much more strongly prefer to reconstruct the "1" for observed
> >> data.
> >>
> >
> > This does seem intuitive.
> > How does the benefit manifest itself?
> > In lowering the RMSE of reconstructing the interaction matrix?
> > Are there any indicators that it results in better recommendations?
> > Koobas
>
> In this approach you are no longer reconstructing the interaction
> matrix, so there is no RMSE vs the interaction matrix. You're
> reconstructing a matrix of 0 and 1. Because entries are weighted
> differently, you're not even minimizing RMSE over that matrix -- the
> point is to take some errors more seriously than others. You're
> minimizing a *weighted* RMSE, yes.
>
> Yes of course the goal is better recommendations.  This broader idea
> is harder to measure. You can use mean average precision to measure
> the tendency to predict back interactions that were held out.
>
> Is it better? depends on better than *what*. Applying algorithms that
> treat input like ratings doesn't work as well on click-like data. The
> main problem is that these will tend to pay too much attention to
> large values. For example if an item was clicked 1000 times, and you
> are trying to actually reconstruct that "1000", then a 10% error
> "costs" (0.1*1000)^2 = 10000. But a 10% error in reconstructing an
> item that was clicked once "costs" (0.1*1)^2 = 0.01. The former is
> considered a million times more important error-wise than the latter,
> even though the intuition is that it's just 1000 times more important.
>
> Better than algorithms that ignore the weight entirely -- yes probably
> if only because you are using more information. But as in all things
> "it depends".
>

Let's say the following.
Classic market basket.
Implicit feedback.
Ones and zeros in the input matrix, no weights in the regularization,
lambda=1.
What I will get is:
A) a reasonable recommender,
B) a joke of a recommender.

Re: Mathematical background of ALS recommenders

Posted by Sean Owen <sr...@gmail.com>.

On Mon, Mar 25, 2013 at 1:41 PM, Koobas <ko...@gmail.com> wrote:
>> But the assumption works nicely for click-like data. Better still when
>> you can "weakly" prefer to reconstruct the 0 for missing observations
>> and much more strongly prefer to reconstruct the "1" for observed
>> data.
>>
>
> This does seem intuitive.
> How does the benefit manifest itself?
> In lowering the RMSE of reconstructing the interaction matrix?
> Are there any indicators that it results in better recommendations?
> Koobas

In this approach you are no longer reconstructing the interaction
matrix, so there is no RMSE vs the interaction matrix. You're
reconstructing a matrix of 0 and 1. Because entries are weighted
differently, you're not even minimizing RMSE over that matrix -- the
point is to take some errors more seriously than others. You're
minimizing a *weighted* RMSE, yes.

Yes of course the goal is better recommendations.  This broader idea
is harder to measure. You can use mean average precision to measure
the tendency to predict back interactions that were held out.

Is it better? depends on better than *what*. Applying algorithms that
treat input like ratings doesn't work as well on click-like data. The
main problem is that these will tend to pay too much attention to
large values. For example if an item was clicked 1000 times, and you
are trying to actually reconstruct that "1000", then a 10% error
"costs" (0.1*1000)^2 = 10000. But a 10% error in reconstructing an
item that was clicked once "costs" (0.1*1)^2 = 0.01. The former is
considered a million times more important error-wise than the latter,
even though the intuition is that it's just 1000 times more important.

Better than algorithms that ignore the weight entirely -- yes probably
if only because you are using more information. But as in all things
"it depends".

Re: Mathematical background of ALS recommenders

Posted by Koobas <ko...@gmail.com>.

On Mon, Mar 25, 2013 at 7:48 AM, Sean Owen <sr...@gmail.com> wrote:

> On Mon, Mar 25, 2013 at 11:25 AM, Sebastian Schelter <ss...@apache.org>
> wrote:
> > Well in LSI it is ok to do that, as a missing entry means that the
> > document contains zero occurrences of a given term which is totally fine.
> >
> > In Collaborative Filtering with explicit feedback, a missing rating is
> > not automatically a rating of zero, it is simply unknown what the user
> > would give as rating.
> >
> > fOR implicit data (number of interactions), a missing entry is indeed
> > zero, but in most cases you might not have the same confidence in that
> > observation as if you observed an interaction. Koren's ALS paper
> > discusses this and introduces constructs to handle this, by putting more
> > weight on minimizing the loss over observed interactions.
> >
> > In matrix factorization for CF, the factorization usually has to
> > minimize the regularized loss over the known entries only. If all
> > unknown entries were simply considered zero, I'd assume that the
> > factorization that resulted would not generalize very well to unseen
> > data. Some researchers title matrix factorization for CF as matrix
> > completion which IMHO better describes the problem.
>
> Yes it's just that you "shouldn't" if inputs are rating-like, not that
> you literally couldn't. If your input is ratings on a scale of 1-5
> then reconstructing a 0 everywhere else means you assume everything
> not viewed is hated, which doesn't work at all. You can subtract the
> mean from observed ratings, and then you assume everything unobserved
> has an average rating.
>
> Thank you so much for this comment.
This is the most intuitive way of approaching it.
Is there really more to it?
(than subtracting the mean and treating zeros as neutral)


> But the assumption works nicely for click-like data. Better still when
> you can "weakly" prefer to reconstruct the 0 for missing observations
> and much more strongly prefer to reconstruct the "1" for observed
> data.
>

This does seem intuitive.
How does the benefit manifest itself?
In lowering the RMSE of reconstructing the interaction matrix?
Are there any indicators that it results in better recommendations?
Koobas

Re: Mathematical background of ALS recommenders

Posted by Sean Owen <sr...@gmail.com>.

On Mon, Mar 25, 2013 at 11:25 AM, Sebastian Schelter <ss...@apache.org> wrote:
> Well in LSI it is ok to do that, as a missing entry means that the
> document contains zero occurrences of a given term which is totally fine.
>
> In Collaborative Filtering with explicit feedback, a missing rating is
> not automatically a rating of zero, it is simply unknown what the user
> would give as rating.
>
> fOR implicit data (number of interactions), a missing entry is indeed
> zero, but in most cases you might not have the same confidence in that
> observation as if you observed an interaction. Koren's ALS paper
> discusses this and introduces constructs to handle this, by putting more
> weight on minimizing the loss over observed interactions.
>
> In matrix factorization for CF, the factorization usually has to
> minimize the regularized loss over the known entries only. If all
> unknown entries were simply considered zero, I'd assume that the
> factorization that resulted would not generalize very well to unseen
> data. Some researchers title matrix factorization for CF as matrix
> completion which IMHO better describes the problem.

Yes it's just that you "shouldn't" if inputs are rating-like, not that
you literally couldn't. If your input is ratings on a scale of 1-5
then reconstructing a 0 everywhere else means you assume everything
not viewed is hated, which doesn't work at all. You can subtract the
mean from observed ratings, and then you assume everything unobserved
has an average rating.

But the assumption works nicely for click-like data. Better still when
you can "weakly" prefer to reconstruct the 0 for missing observations
and much more strongly prefer to reconstruct the "1" for observed
data.

Re: Mathematical background of ALS recommenders

Posted by Sebastian Schelter <ss...@googlemail.com>.

As clarification, here are the relevant papers. The approach for
explicit feedback [1] does not use unobserved cells, only the approch
for handling implicit feedback [2] does, but weighs them down.

/s

[1] "Large-scale Parallel Collaborative Filtering for the Netﬂix Prize"
http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf

[2] "Collaborative Filtering for Implicit Feedback Datasets"
http://research.yahoo.com/pub/2433


On 25.03.2013 14:51, Dmitriy Lyubimov wrote:
> On Mar 25, 2013 6:44 AM, "Sean Owen" <sr...@gmail.com> wrote:
>>
>> (The unobserved entries are still in the loss function, just with low
>> weight. They are also in the system of equations you are solving for.)
> 
> Not in the classic alswr paper i was specifically referring to. It actually
> uses minors of observations with unobserved rows or columns  thrown out.
> The solution you are often referring to, the implicit feedback, indeed does
> not since it is using a different observation encoding technique.
> 
>>
>> On Mon, Mar 25, 2013 at 1:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>>> Classic als wr is bypassing underlearning problem by cutting out unrated
>>> entries from linear equations system. It also still has a fery defined
>>> regularization technique which allows to find optimal fit in theory (but
>>> still not in mahout, not without at least some additional sweat, i
> heard).
>

Re: Mathematical background of ALS recommenders

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Mar 25, 2013 6:44 AM, "Sean Owen" <sr...@gmail.com> wrote:
>
> (The unobserved entries are still in the loss function, just with low
> weight. They are also in the system of equations you are solving for.)

Not in the classic alswr paper i was specifically referring to. It actually
uses minors of observations with unobserved rows or columns  thrown out.
The solution you are often referring to, the implicit feedback, indeed does
not since it is using a different observation encoding technique.

>
> On Mon, Mar 25, 2013 at 1:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:
> > Classic als wr is bypassing underlearning problem by cutting out unrated
> > entries from linear equations system. It also still has a fery defined
> > regularization technique which allows to find optimal fit in theory (but
> > still not in mahout, not without at least some additional sweat, i
heard).

Re: Mathematical background of ALS recommenders

Posted by Sean Owen <sr...@gmail.com>.

(The unobserved entries are still in the loss function, just with low
weight. They are also in the system of equations you are solving for.)

On Mon, Mar 25, 2013 at 1:38 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Classic als wr is bypassing underlearning problem by cutting out unrated
> entries from linear equations system. It also still has a fery defined
> regularization technique which allows to find optimal fit in theory (but
> still not in mahout, not without at least some additional sweat, i heard).

Re: Mathematical background of ALS recommenders

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Mar 25, 2013 6:38 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
>
>
> On Mar 25, 2013 4:15 AM, "Ted Dunning" <te...@gmail.com> wrote:
> >
> > Well, actually, you can.
> >
> > LSI does exactly that.
> >
> > What the effect is of doing this is not clear to me.  Do you know what
> > happens if you assume missing values are 0?
>
> I thought about  that . My understanding is that with lsi outting zero is
not lack of observation, it is actually an observation that the tern has
not been used in the document which obwerves it as irrelevant to the
document.
Or vice versa it is used everywhere and the value of the observation is
zero.
>
> With cf, if you put zero (or, rather, an average of the entries in the
ovservation vector) it is a too severe if a patch for a lack of ovservation.
>
> Typically such patching has terrible effectw in practice and has an
effect similar to lack of learning rat or training, or overregularized
result.
>
> Indeed, lets assume my average movie rating is 3. And i am into action
movies so perhaps i rated all good action movies as 5. But ionly warched
100 action movies and netfkix has 10000 such similar movies, so the matrix
input is filled to assume that i have a rating of 5 for 100 and 3 for 9900.
That is it is still gives me barely a nudge when we try to solve a spase
problem with ssvd.
>
> Classic als wr is bypassing underlearning problem by cutting out unrated
entries from linear equations system. It also still has a fery defined
regularization technique which allows to find optimal fit in theory (but
still not in mahout, not without at least some additional sweat, i heard).
>
> D
>
> >
> > On Mon, Mar 25, 2013 at 12:10 PM, Sebastian Schelter <ss...@apache.org>
wrote:
> >
> > > I think one crucial point is missing from this discussion. In
> > > Collaborative Filtering the matrix to decompose is only partially
> > > defined. A missing entry cannot simply be substituted by 0, it is
> > > unknown. Therefore, you cannot use standard SVD techniques as you
would
> > > use for LSI for example.
> > >
> > > On 25.03.2013 11:46, Sean Owen wrote:
> > > > Points from across several e-mails --
> > > >
> > > > The initial item-feature matrix can be just random unit vectors
too. I
> > > > have slightly better results with that.
> > > >
> > > > You are finding the least-squares solution of A = U M' for U given A
> > > > and M. Yes you can derive that analytically as the zero of the
> > > > derivative of the error function.
> > > >
> > > > With m users and n items, and k features, where k=n, then I suppose
> > > > you don't need any iterations at all since there is a trivial
> > > > solution: U = A, M = I(n) (the nxn identity matrix). You would not
> > > > find this on the first iteration, however, if you followed the
> > > > algorithm, because you would be starting from some random starting
> > > > point. But if you initialized M to the identity matrix, yes you'd
find
> > > > the exact solution immediately.
> > > >
> > > > Yes it is an iterative algorithm and you have to define a
convergence
> > > > criterion. I use average absolute difference in (U M') entries from
> > > > one iteration to the next. (Well, a sample.) It's possible that you
> > > > reach your criterion in 1 iteration, or, not. It depends on the
> > > > criterion. Usually when you restart ALS on updated data, you use the
> > > > previous M matrix as a starting point. If the change in data is
small,
> > > > one iteration should usually leave you still "converged" actually.
> > > > But, from random starting point -- not typical.
> > > >
> > > > ALS is similar to SVD only in broad terms. The SVD is not always
used
> > > > to make a low-rank factorization, and, its outputs carry more
> > > > guarantees -- they are orthonormal bases because it has factored out
> > > > scaling factors into the third matrix Sigma. I think of ALS as more
> > > > simplistic and therefore possibly faster. WIth k features I believe
> > > > (?) the SVD is necessarily a k-iteration process at least, whereas
ALS
> > > > might be of use after 1 iteration. The SVD is not a "shortcut" for
> > > > ALS. If you go to the trouble of a full SVD, you can certainly use
> > > > that factorization as-is though.
> > > >
> > > > You need regularization.
> > > >
> > > >
> > > > It should be pointed out that the "ALS" often spoken of here is not
> > > > one that factors the input matrix A. There's a variant, that I have
> > > > had good results with, for 'implicit' feedback. There, you are
> > > > actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and
using
> > > > values in A as weights in the loss function. You're reconstructing
> > > > "interacted or not" and using input value as a confidence measure.
> > > > This "works" for ratings although the interpretation in this variant
> > > > doesn't line up with the nature of ratings. It works quite nicely
for
> > > > things like clicks, etc.
> > > >
> > > > (Much more can be said on this point.)
> > > >
> > > >
> > > >
> > > > On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <
contact@dhuebner.com>
> > > wrote:
> > > >> It's quite hard for me to get the mathematical concepts of the ALS
> > > >> recommenders. It would be great if someone could help me to figure
out
> > > >> the details. This is my current status:
> > > >>
> > > >> 1. The item-feature (M) matrix is initialized using the average
ratings
> > > >> and random values (explicit case)
> > > >>
> > > >> 2. The user-feature (U) matrix is solved using the partial
derivative of
> > > >> the error function with respect to u_i (the columns of row-vectors
of U)
> > > >>
> > > >> Supposed we use as many features as items are known and the error
> > > >> function does not use any regularization. Would U be solved within
the
> > > >> first iteration? If not, I do not understand why more than one
iteration
> > > >> is needed.
> > > >> Furthermore, I believe to have understood that using fewer
features than
> > > >> items and also applying regularization, does not allow to solve U
in a
> > > >> way that the stopping criterion can be met after only one
iteration.
> > > >> Thus, iteration is required to gradually converge to the stopping
> > > >> criterion.
> > > >>
> > > >> I hope I have pointed out my problems clearly enough.
> > > >>
> > >
> > >

Re: Mathematical background of ALS recommenders

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Mar 25, 2013 4:15 AM, "Ted Dunning" <te...@gmail.com> wrote:
>
> Well, actually, you can.
>
> LSI does exactly that.
>
> What the effect is of doing this is not clear to me.  Do you know what
> happens if you assume missing values are 0?

I thought about  that . My understanding is that with lsi outting zero is
not lack of observation, it is actually an observation that the tern has
not been used in the document which obwerves it as irrelevant to the
document.

With cf, if you put zero (or, rather, an average of the entries in the
ovservation vector) it is a too severe if a patch for a lack of ovservation.

Typically such patching has terrible effectw in practice and has an effect
similar to lack of learning rat or training, or overregularized result.

Indeed, lets assume my average movie rating is 3. And i am into action
movies so perhaps i rated all good action movies as 5. But ionly warched
100 action movies and netfkix has 10000 such similar movies, so the matrix
input is filled to assume that i have a rating of 5 for 100 and 3 for 9900.
That is it is still gives me barely a nudge when we try to solve a spase
problem with ssvd.

Classic als wr is bypassing underlearning problem by cutting out unrated
entries from linear equations system. It also still has a fery defined
regularization technique which allows to find optimal fit in theory (but
still not in mahout, not without at least some additional sweat, i heard).

D

>
> On Mon, Mar 25, 2013 at 12:10 PM, Sebastian Schelter <ss...@apache.org>
wrote:
>
> > I think one crucial point is missing from this discussion. In
> > Collaborative Filtering the matrix to decompose is only partially
> > defined. A missing entry cannot simply be substituted by 0, it is
> > unknown. Therefore, you cannot use standard SVD techniques as you would
> > use for LSI for example.
> >
> > On 25.03.2013 11:46, Sean Owen wrote:
> > > Points from across several e-mails --
> > >
> > > The initial item-feature matrix can be just random unit vectors too. I
> > > have slightly better results with that.
> > >
> > > You are finding the least-squares solution of A = U M' for U given A
> > > and M. Yes you can derive that analytically as the zero of the
> > > derivative of the error function.
> > >
> > > With m users and n items, and k features, where k=n, then I suppose
> > > you don't need any iterations at all since there is a trivial
> > > solution: U = A, M = I(n) (the nxn identity matrix). You would not
> > > find this on the first iteration, however, if you followed the
> > > algorithm, because you would be starting from some random starting
> > > point. But if you initialized M to the identity matrix, yes you'd find
> > > the exact solution immediately.
> > >
> > > Yes it is an iterative algorithm and you have to define a convergence
> > > criterion. I use average absolute difference in (U M') entries from
> > > one iteration to the next. (Well, a sample.) It's possible that you
> > > reach your criterion in 1 iteration, or, not. It depends on the
> > > criterion. Usually when you restart ALS on updated data, you use the
> > > previous M matrix as a starting point. If the change in data is small,
> > > one iteration should usually leave you still "converged" actually.
> > > But, from random starting point -- not typical.
> > >
> > > ALS is similar to SVD only in broad terms. The SVD is not always used
> > > to make a low-rank factorization, and, its outputs carry more
> > > guarantees -- they are orthonormal bases because it has factored out
> > > scaling factors into the third matrix Sigma. I think of ALS as more
> > > simplistic and therefore possibly faster. WIth k features I believe
> > > (?) the SVD is necessarily a k-iteration process at least, whereas ALS
> > > might be of use after 1 iteration. The SVD is not a "shortcut" for
> > > ALS. If you go to the trouble of a full SVD, you can certainly use
> > > that factorization as-is though.
> > >
> > > You need regularization.
> > >
> > >
> > > It should be pointed out that the "ALS" often spoken of here is not
> > > one that factors the input matrix A. There's a variant, that I have
> > > had good results with, for 'implicit' feedback. There, you are
> > > actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
> > > values in A as weights in the loss function. You're reconstructing
> > > "interacted or not" and using input value as a confidence measure.
> > > This "works" for ratings although the interpretation in this variant
> > > doesn't line up with the nature of ratings. It works quite nicely for
> > > things like clicks, etc.
> > >
> > > (Much more can be said on this point.)
> > >
> > >
> > >
> > > On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <contact@dhuebner.com
>
> > wrote:
> > >> It's quite hard for me to get the mathematical concepts of the ALS
> > >> recommenders. It would be great if someone could help me to figure
out
> > >> the details. This is my current status:
> > >>
> > >> 1. The item-feature (M) matrix is initialized using the average
ratings
> > >> and random values (explicit case)
> > >>
> > >> 2. The user-feature (U) matrix is solved using the partial
derivative of
> > >> the error function with respect to u_i (the columns of row-vectors
of U)
> > >>
> > >> Supposed we use as many features as items are known and the error
> > >> function does not use any regularization. Would U be solved within
the
> > >> first iteration? If not, I do not understand why more than one
iteration
> > >> is needed.
> > >> Furthermore, I believe to have understood that using fewer features
than
> > >> items and also applying regularization, does not allow to solve U in
a
> > >> way that the stopping criterion can be met after only one iteration.
> > >> Thus, iteration is required to gradually converge to the stopping
> > >> criterion.
> > >>
> > >> I hope I have pointed out my problems clearly enough.
> > >>
> >
> >

Re: Mathematical background of ALS recommenders

Posted by Abhijith CHandraprabhu <ab...@gmail.com>.

Inability of SVD to handle the missing ratings in the case explicit rating
scheme was the reason I had to use ALS. In my case I am using MATLAB, which
uses Lanczos Tridiagonilaztion for SSVD, but however the point is that SSVD
cannot work with missing entry, so it automatically assumes it as zeros, it
does not give the expected results. I guess this answers Teds question,
i.e., LSI with missing values as 0s. I can share the results if needed.



On Mon, Mar 25, 2013 at 12:25 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Well in LSI it is ok to do that, as a missing entry means that the
> document contains zero occurrences of a given term which is totally fine.
>
> In Collaborative Filtering with explicit feedback, a missing rating is
> not automatically a rating of zero, it is simply unknown what the user
> would give as rating.
>
> fOR implicit data (number of interactions), a missing entry is indeed
> zero, but in most cases you might not have the same confidence in that
> observation as if you observed an interaction. Koren's ALS paper
> discusses this and introduces constructs to handle this, by putting more
> weight on minimizing the loss over observed interactions.
>
> In matrix factorization for CF, the factorization usually has to
> minimize the regularized loss over the known entries only. If all
> unknown entries were simply considered zero, I'd assume that the
> factorization that resulted would not generalize very well to unseen
> data. Some researchers title matrix factorization for CF as matrix
> completion which IMHO better describes the problem.
>
> On 25.03.2013 12:14, Ted Dunning wrote:
> > Well, actually, you can.
> >
> > LSI does exactly that.
> >
> > What the effect is of doing this is not clear to me.  Do you know what
> > happens if you assume missing values are 0?
> >
> > On Mon, Mar 25, 2013 at 12:10 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
> >
> >> I think one crucial point is missing from this discussion. In
> >> Collaborative Filtering the matrix to decompose is only partially
> >> defined. A missing entry cannot simply be substituted by 0, it is
> >> unknown. Therefore, you cannot use standard SVD techniques as you would
> >> use for LSI for example.
> >>
> >> On 25.03.2013 11:46, Sean Owen wrote:
> >>> Points from across several e-mails --
> >>>
> >>> The initial item-feature matrix can be just random unit vectors too. I
> >>> have slightly better results with that.
> >>>
> >>> You are finding the least-squares solution of A = U M' for U given A
> >>> and M. Yes you can derive that analytically as the zero of the
> >>> derivative of the error function.
> >>>
> >>> With m users and n items, and k features, where k=n, then I suppose
> >>> you don't need any iterations at all since there is a trivial
> >>> solution: U = A, M = I(n) (the nxn identity matrix). You would not
> >>> find this on the first iteration, however, if you followed the
> >>> algorithm, because you would be starting from some random starting
> >>> point. But if you initialized M to the identity matrix, yes you'd find
> >>> the exact solution immediately.
> >>>
> >>> Yes it is an iterative algorithm and you have to define a convergence
> >>> criterion. I use average absolute difference in (U M') entries from
> >>> one iteration to the next. (Well, a sample.) It's possible that you
> >>> reach your criterion in 1 iteration, or, not. It depends on the
> >>> criterion. Usually when you restart ALS on updated data, you use the
> >>> previous M matrix as a starting point. If the change in data is small,
> >>> one iteration should usually leave you still "converged" actually.
> >>> But, from random starting point -- not typical.
> >>>
> >>> ALS is similar to SVD only in broad terms. The SVD is not always used
> >>> to make a low-rank factorization, and, its outputs carry more
> >>> guarantees -- they are orthonormal bases because it has factored out
> >>> scaling factors into the third matrix Sigma. I think of ALS as more
> >>> simplistic and therefore possibly faster. WIth k features I believe
> >>> (?) the SVD is necessarily a k-iteration process at least, whereas ALS
> >>> might be of use after 1 iteration. The SVD is not a "shortcut" for
> >>> ALS. If you go to the trouble of a full SVD, you can certainly use
> >>> that factorization as-is though.
> >>>
> >>> You need regularization.
> >>>
> >>>
> >>> It should be pointed out that the "ALS" often spoken of here is not
> >>> one that factors the input matrix A. There's a variant, that I have
> >>> had good results with, for 'implicit' feedback. There, you are
> >>> actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
> >>> values in A as weights in the loss function. You're reconstructing
> >>> "interacted or not" and using input value as a confidence measure.
> >>> This "works" for ratings although the interpretation in this variant
> >>> doesn't line up with the nature of ratings. It works quite nicely for
> >>> things like clicks, etc.
> >>>
> >>> (Much more can be said on this point.)
> >>>
> >>>
> >>>
> >>> On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <contact@dhuebner.com
> >
> >> wrote:
> >>>> It's quite hard for me to get the mathematical concepts of the ALS
> >>>> recommenders. It would be great if someone could help me to figure out
> >>>> the details. This is my current status:
> >>>>
> >>>> 1. The item-feature (M) matrix is initialized using the average
> ratings
> >>>> and random values (explicit case)
> >>>>
> >>>> 2. The user-feature (U) matrix is solved using the partial derivative
> of
> >>>> the error function with respect to u_i (the columns of row-vectors of
> U)
> >>>>
> >>>> Supposed we use as many features as items are known and the error
> >>>> function does not use any regularization. Would U be solved within the
> >>>> first iteration? If not, I do not understand why more than one
> iteration
> >>>> is needed.
> >>>> Furthermore, I believe to have understood that using fewer features
> than
> >>>> items and also applying regularization, does not allow to solve U in a
> >>>> way that the stopping criterion can be met after only one iteration.
> >>>> Thus, iteration is required to gradually converge to the stopping
> >>>> criterion.
> >>>>
> >>>> I hope I have pointed out my problems clearly enough.
> >>>>
> >>
> >>
> >
>
>


-- 
Best regards,
Abhijith

Re: Mathematical background of ALS recommenders

Posted by Sebastian Schelter <ss...@apache.org>.

Well in LSI it is ok to do that, as a missing entry means that the
document contains zero occurrences of a given term which is totally fine.

In Collaborative Filtering with explicit feedback, a missing rating is
not automatically a rating of zero, it is simply unknown what the user
would give as rating.

fOR implicit data (number of interactions), a missing entry is indeed
zero, but in most cases you might not have the same confidence in that
observation as if you observed an interaction. Koren's ALS paper
discusses this and introduces constructs to handle this, by putting more
weight on minimizing the loss over observed interactions.

In matrix factorization for CF, the factorization usually has to
minimize the regularized loss over the known entries only. If all
unknown entries were simply considered zero, I'd assume that the
factorization that resulted would not generalize very well to unseen
data. Some researchers title matrix factorization for CF as matrix
completion which IMHO better describes the problem.

On 25.03.2013 12:14, Ted Dunning wrote:
> Well, actually, you can.
> 
> LSI does exactly that.
> 
> What the effect is of doing this is not clear to me.  Do you know what
> happens if you assume missing values are 0?
> 
> On Mon, Mar 25, 2013 at 12:10 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> I think one crucial point is missing from this discussion. In
>> Collaborative Filtering the matrix to decompose is only partially
>> defined. A missing entry cannot simply be substituted by 0, it is
>> unknown. Therefore, you cannot use standard SVD techniques as you would
>> use for LSI for example.
>>
>> On 25.03.2013 11:46, Sean Owen wrote:
>>> Points from across several e-mails --
>>>
>>> The initial item-feature matrix can be just random unit vectors too. I
>>> have slightly better results with that.
>>>
>>> You are finding the least-squares solution of A = U M' for U given A
>>> and M. Yes you can derive that analytically as the zero of the
>>> derivative of the error function.
>>>
>>> With m users and n items, and k features, where k=n, then I suppose
>>> you don't need any iterations at all since there is a trivial
>>> solution: U = A, M = I(n) (the nxn identity matrix). You would not
>>> find this on the first iteration, however, if you followed the
>>> algorithm, because you would be starting from some random starting
>>> point. But if you initialized M to the identity matrix, yes you'd find
>>> the exact solution immediately.
>>>
>>> Yes it is an iterative algorithm and you have to define a convergence
>>> criterion. I use average absolute difference in (U M') entries from
>>> one iteration to the next. (Well, a sample.) It's possible that you
>>> reach your criterion in 1 iteration, or, not. It depends on the
>>> criterion. Usually when you restart ALS on updated data, you use the
>>> previous M matrix as a starting point. If the change in data is small,
>>> one iteration should usually leave you still "converged" actually.
>>> But, from random starting point -- not typical.
>>>
>>> ALS is similar to SVD only in broad terms. The SVD is not always used
>>> to make a low-rank factorization, and, its outputs carry more
>>> guarantees -- they are orthonormal bases because it has factored out
>>> scaling factors into the third matrix Sigma. I think of ALS as more
>>> simplistic and therefore possibly faster. WIth k features I believe
>>> (?) the SVD is necessarily a k-iteration process at least, whereas ALS
>>> might be of use after 1 iteration. The SVD is not a "shortcut" for
>>> ALS. If you go to the trouble of a full SVD, you can certainly use
>>> that factorization as-is though.
>>>
>>> You need regularization.
>>>
>>>
>>> It should be pointed out that the "ALS" often spoken of here is not
>>> one that factors the input matrix A. There's a variant, that I have
>>> had good results with, for 'implicit' feedback. There, you are
>>> actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
>>> values in A as weights in the loss function. You're reconstructing
>>> "interacted or not" and using input value as a confidence measure.
>>> This "works" for ratings although the interpretation in this variant
>>> doesn't line up with the nature of ratings. It works quite nicely for
>>> things like clicks, etc.
>>>
>>> (Much more can be said on this point.)
>>>
>>>
>>>
>>> On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <co...@dhuebner.com>
>> wrote:
>>>> It's quite hard for me to get the mathematical concepts of the ALS
>>>> recommenders. It would be great if someone could help me to figure out
>>>> the details. This is my current status:
>>>>
>>>> 1. The item-feature (M) matrix is initialized using the average ratings
>>>> and random values (explicit case)
>>>>
>>>> 2. The user-feature (U) matrix is solved using the partial derivative of
>>>> the error function with respect to u_i (the columns of row-vectors of U)
>>>>
>>>> Supposed we use as many features as items are known and the error
>>>> function does not use any regularization. Would U be solved within the
>>>> first iteration? If not, I do not understand why more than one iteration
>>>> is needed.
>>>> Furthermore, I believe to have understood that using fewer features than
>>>> items and also applying regularization, does not allow to solve U in a
>>>> way that the stopping criterion can be met after only one iteration.
>>>> Thus, iteration is required to gradually converge to the stopping
>>>> criterion.
>>>>
>>>> I hope I have pointed out my problems clearly enough.
>>>>
>>
>>
>

Re: Mathematical background of ALS recommenders

Posted by Ted Dunning <te...@gmail.com>.

Well, actually, you can.

LSI does exactly that.

What the effect is of doing this is not clear to me.  Do you know what
happens if you assume missing values are 0?

On Mon, Mar 25, 2013 at 12:10 PM, Sebastian Schelter <ss...@apache.org> wrote:

> I think one crucial point is missing from this discussion. In
> Collaborative Filtering the matrix to decompose is only partially
> defined. A missing entry cannot simply be substituted by 0, it is
> unknown. Therefore, you cannot use standard SVD techniques as you would
> use for LSI for example.
>
> On 25.03.2013 11:46, Sean Owen wrote:
> > Points from across several e-mails --
> >
> > The initial item-feature matrix can be just random unit vectors too. I
> > have slightly better results with that.
> >
> > You are finding the least-squares solution of A = U M' for U given A
> > and M. Yes you can derive that analytically as the zero of the
> > derivative of the error function.
> >
> > With m users and n items, and k features, where k=n, then I suppose
> > you don't need any iterations at all since there is a trivial
> > solution: U = A, M = I(n) (the nxn identity matrix). You would not
> > find this on the first iteration, however, if you followed the
> > algorithm, because you would be starting from some random starting
> > point. But if you initialized M to the identity matrix, yes you'd find
> > the exact solution immediately.
> >
> > Yes it is an iterative algorithm and you have to define a convergence
> > criterion. I use average absolute difference in (U M') entries from
> > one iteration to the next. (Well, a sample.) It's possible that you
> > reach your criterion in 1 iteration, or, not. It depends on the
> > criterion. Usually when you restart ALS on updated data, you use the
> > previous M matrix as a starting point. If the change in data is small,
> > one iteration should usually leave you still "converged" actually.
> > But, from random starting point -- not typical.
> >
> > ALS is similar to SVD only in broad terms. The SVD is not always used
> > to make a low-rank factorization, and, its outputs carry more
> > guarantees -- they are orthonormal bases because it has factored out
> > scaling factors into the third matrix Sigma. I think of ALS as more
> > simplistic and therefore possibly faster. WIth k features I believe
> > (?) the SVD is necessarily a k-iteration process at least, whereas ALS
> > might be of use after 1 iteration. The SVD is not a "shortcut" for
> > ALS. If you go to the trouble of a full SVD, you can certainly use
> > that factorization as-is though.
> >
> > You need regularization.
> >
> >
> > It should be pointed out that the "ALS" often spoken of here is not
> > one that factors the input matrix A. There's a variant, that I have
> > had good results with, for 'implicit' feedback. There, you are
> > actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
> > values in A as weights in the loss function. You're reconstructing
> > "interacted or not" and using input value as a confidence measure.
> > This "works" for ratings although the interpretation in this variant
> > doesn't line up with the nature of ratings. It works quite nicely for
> > things like clicks, etc.
> >
> > (Much more can be said on this point.)
> >
> >
> >
> > On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <co...@dhuebner.com>
> wrote:
> >> It's quite hard for me to get the mathematical concepts of the ALS
> >> recommenders. It would be great if someone could help me to figure out
> >> the details. This is my current status:
> >>
> >> 1. The item-feature (M) matrix is initialized using the average ratings
> >> and random values (explicit case)
> >>
> >> 2. The user-feature (U) matrix is solved using the partial derivative of
> >> the error function with respect to u_i (the columns of row-vectors of U)
> >>
> >> Supposed we use as many features as items are known and the error
> >> function does not use any regularization. Would U be solved within the
> >> first iteration? If not, I do not understand why more than one iteration
> >> is needed.
> >> Furthermore, I believe to have understood that using fewer features than
> >> items and also applying regularization, does not allow to solve U in a
> >> way that the stopping criterion can be met after only one iteration.
> >> Thus, iteration is required to gradually converge to the stopping
> >> criterion.
> >>
> >> I hope I have pointed out my problems clearly enough.
> >>
>
>

Re: Mathematical background of ALS recommenders

Posted by Sebastian Schelter <ss...@apache.org>.

I think one crucial point is missing from this discussion. In
Collaborative Filtering the matrix to decompose is only partially
defined. A missing entry cannot simply be substituted by 0, it is
unknown. Therefore, you cannot use standard SVD techniques as you would
use for LSI for example.

On 25.03.2013 11:46, Sean Owen wrote:
> Points from across several e-mails --
> 
> The initial item-feature matrix can be just random unit vectors too. I
> have slightly better results with that.
> 
> You are finding the least-squares solution of A = U M' for U given A
> and M. Yes you can derive that analytically as the zero of the
> derivative of the error function.
> 
> With m users and n items, and k features, where k=n, then I suppose
> you don't need any iterations at all since there is a trivial
> solution: U = A, M = I(n) (the nxn identity matrix). You would not
> find this on the first iteration, however, if you followed the
> algorithm, because you would be starting from some random starting
> point. But if you initialized M to the identity matrix, yes you'd find
> the exact solution immediately.
> 
> Yes it is an iterative algorithm and you have to define a convergence
> criterion. I use average absolute difference in (U M') entries from
> one iteration to the next. (Well, a sample.) It's possible that you
> reach your criterion in 1 iteration, or, not. It depends on the
> criterion. Usually when you restart ALS on updated data, you use the
> previous M matrix as a starting point. If the change in data is small,
> one iteration should usually leave you still "converged" actually.
> But, from random starting point -- not typical.
> 
> ALS is similar to SVD only in broad terms. The SVD is not always used
> to make a low-rank factorization, and, its outputs carry more
> guarantees -- they are orthonormal bases because it has factored out
> scaling factors into the third matrix Sigma. I think of ALS as more
> simplistic and therefore possibly faster. WIth k features I believe
> (?) the SVD is necessarily a k-iteration process at least, whereas ALS
> might be of use after 1 iteration. The SVD is not a "shortcut" for
> ALS. If you go to the trouble of a full SVD, you can certainly use
> that factorization as-is though.
> 
> You need regularization.
> 
> 
> It should be pointed out that the "ALS" often spoken of here is not
> one that factors the input matrix A. There's a variant, that I have
> had good results with, for 'implicit' feedback. There, you are
> actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
> values in A as weights in the loss function. You're reconstructing
> "interacted or not" and using input value as a confidence measure.
> This "works" for ratings although the interpretation in this variant
> doesn't line up with the nature of ratings. It works quite nicely for
> things like clicks, etc.
> 
> (Much more can be said on this point.)
> 
> 
> 
> On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <co...@dhuebner.com> wrote:
>> It's quite hard for me to get the mathematical concepts of the ALS
>> recommenders. It would be great if someone could help me to figure out
>> the details. This is my current status:
>>
>> 1. The item-feature (M) matrix is initialized using the average ratings
>> and random values (explicit case)
>>
>> 2. The user-feature (U) matrix is solved using the partial derivative of
>> the error function with respect to u_i (the columns of row-vectors of U)
>>
>> Supposed we use as many features as items are known and the error
>> function does not use any regularization. Would U be solved within the
>> first iteration? If not, I do not understand why more than one iteration
>> is needed.
>> Furthermore, I believe to have understood that using fewer features than
>> items and also applying regularization, does not allow to solve U in a
>> way that the stopping criterion can be met after only one iteration.
>> Thus, iteration is required to gradually converge to the stopping
>> criterion.
>>
>> I hope I have pointed out my problems clearly enough.
>>

Re: Mathematical background of ALS recommenders

Posted by Sean Owen <sr...@gmail.com>.

Points from across several e-mails --

The initial item-feature matrix can be just random unit vectors too. I
have slightly better results with that.

You are finding the least-squares solution of A = U M' for U given A
and M. Yes you can derive that analytically as the zero of the
derivative of the error function.

With m users and n items, and k features, where k=n, then I suppose
you don't need any iterations at all since there is a trivial
solution: U = A, M = I(n) (the nxn identity matrix). You would not
find this on the first iteration, however, if you followed the
algorithm, because you would be starting from some random starting
point. But if you initialized M to the identity matrix, yes you'd find
the exact solution immediately.

Yes it is an iterative algorithm and you have to define a convergence
criterion. I use average absolute difference in (U M') entries from
one iteration to the next. (Well, a sample.) It's possible that you
reach your criterion in 1 iteration, or, not. It depends on the
criterion. Usually when you restart ALS on updated data, you use the
previous M matrix as a starting point. If the change in data is small,
one iteration should usually leave you still "converged" actually.
But, from random starting point -- not typical.

ALS is similar to SVD only in broad terms. The SVD is not always used
to make a low-rank factorization, and, its outputs carry more
guarantees -- they are orthonormal bases because it has factored out
scaling factors into the third matrix Sigma. I think of ALS as more
simplistic and therefore possibly faster. WIth k features I believe
(?) the SVD is necessarily a k-iteration process at least, whereas ALS
might be of use after 1 iteration. The SVD is not a "shortcut" for
ALS. If you go to the trouble of a full SVD, you can certainly use
that factorization as-is though.

You need regularization.

It should be pointed out that the "ALS" often spoken of here is not
one that factors the input matrix A. There's a variant, that I have
had good results with, for 'implicit' feedback. There, you are
actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
values in A as weights in the loss function. You're reconstructing
"interacted or not" and using input value as a confidence measure.
This "works" for ratings although the interpretation in this variant
doesn't line up with the nature of ratings. It works quite nicely for
things like clicks, etc.

(Much more can be said on this point.)

On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <co...@dhuebner.com> wrote:
> It's quite hard for me to get the mathematical concepts of the ALS
> recommenders. It would be great if someone could help me to figure out
> the details. This is my current status:
>
> 1. The item-feature (M) matrix is initialized using the average ratings
> and random values (explicit case)
>
> 2. The user-feature (U) matrix is solved using the partial derivative of
> the error function with respect to u_i (the columns of row-vectors of U)
>
> Supposed we use as many features as items are known and the error
> function does not use any regularization. Would U be solved within the
> first iteration? If not, I do not understand why more than one iteration
> is needed.
> Furthermore, I believe to have understood that using fewer features than
> items and also applying regularization, does not allow to solve U in a
> way that the stopping criterion can be met after only one iteration.
> Thus, iteration is required to gradually converge to the stopping
> criterion.
>
> I hope I have pointed out my problems clearly enough.
>