You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Laszlo Dosa <la...@fredhopper.com> on 2010/08/04 01:35:17 UTC
RE: distributed svd

Jake,

I tried to implement A . V . S^-1 for calculating the U matrix.
Unfortunately, I had a couple of difficulties.

My original A matrix is mxn (rowsxcolumns)
After calculating Lanczos( A ) = V_k, V_k has k rows and n columns.

To use DistributedRowMatrix.times() I need to transpose A and V_k, and thus,

calculating U_k seems to be costly:

	U_k = V_k.transpose().times(A.transpose()) . S^-1

I would like to avoid transposing both matrices. Is it possible?
Are there any other way to implement A . V . S^-1 effectively?
Any ideas are welcome.

Regards,
Laszlo

-----Original Message-----
From: Jake Mannix [mailto:jake.mannix@gmail.com] 
Sent: Saturday, July 24, 2010 1:10 AM
To: user@mahout.apache.org
Subject: Re: distributed svd

Hi Lazlo,

On Fri, Jul 23, 2010 at 1:14 AM, Laszlo Dosa <la...@fredhopper.com>
wrote:

I tried to calculate SVD based recommendation by:

1)      Calculate A'A

2)      Calculate V_k by using Lanczos to get top k eigenvectors of A'A

 The SVD implementation in Mahout does not require you do these two steps
separately - they happen at the same time if you pass in the A matrix to
the DistributedLanczosSolver directly.

3)      Calculate preferences for each user by Vector usersPreference =
V_k.timesSquared(row of A);

 What you want to do here is actually take the A matrix, represented by a
DistributedRowMatrix, and the V_k matrix (you actually get basically V_k's
transposed from the SVD job), represented by that same class, and then just
compute A . V . S^-1, where S is the diagonal matrix of singular values
(note that these are the sqrts of the eigenvalues you get out of Lancos).

A . V can be computed in one mapreduce job (DistributedRowMatrix.times() ),
so that's the way to do that efficiently.

  -jake


4)      Calculate 20 recommendation for each user by selecting the top 20
preference from usersPreference

The number of users and items is between 100 000 and 300 000.

In my experience Step 3) scales linearly by the number of users.

Each job runs fast, but I have ~100 000 jobs. This slows down the whole
calculation.

Is the approach wrong that I use?

Or can you suggest something to  make this faster?

Regards,

Laszlo