You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jake Mannix <ja...@gmail.com> on 2011/01/06 22:45:18 UTC

Re: seq2sparse and lsi fold-in

Dmitriy,

  I'm not sure if you figured this out on your own and I didn't see the
email,
but if not:

On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Also, if i have a bunch of new documents to fold-in, it looks like i'd need
> to run a matrix multiplication job between new document vectors and V, both
> matrices represented row-wise. So DistributedRowMatrix should help me,
> shouldn't it? do i need to transpose the first matrix first?
>

If you have a dense matrix V of eigenvectors (ie, it has K (a small number
like 100's) rows of dense vectors, each of which are cardinality M (which
may large)), which is a DistributedRowMatrix, and you have your original
document matrix C, which has N rows, each of which has cardinality M, then
you actually need to take the transpose of *both* matrices, then take
the DistributedRowMatrix.times() on these:

  V_transpose = V.transpose();
  C_transpose = C.transpose();
  C_times_V_transpose = C_transpose.times(V_transpose);

This code will yield the mathematical result of C * V^T, which is probably
what you want.

(it turns out that this set of operations could also be done in a custom
operation
using the row-paths of both V and C as inputs, but you'd still require two
MapReduce shuffles to get the answer, so it's not really a savings to do
this).

  -jake

Re: seq2sparse and lsi fold-in

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Thank you, Jake.

Yes, i have figured that, and it seems that DRM.times does just that. I was
just not sure of the production quality of this code. It seems DRM
experiences a lot of fixes and discussions lately, including simple
multiplication.

On a side node one needs to compute  Cx V^t x Sigma^-1 . But i have an
option in stochastic svd command line to compute V x Sigma ^ 0.5 instead of
V and U x Sigma ^ 0.5 instead of U , in which case correction for singular
vectors indeed turns into simple multiplication C x V^t  and singular values
matrix can be ignored . (esp if one may want to measure similarities between
a user and an item, not just user-user or item-item).

-d

On Thu, Jan 6, 2011 at 1:45 PM, Jake Mannix <ja...@gmail.com> wrote:

> Dmitriy,
>
>  I'm not sure if you figured this out on your own and I didn't see the
> email,
> but if not:
>
> On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Also, if i have a bunch of new documents to fold-in, it looks like i'd
> need
> > to run a matrix multiplication job between new document vectors and V,
> both
> > matrices represented row-wise. So DistributedRowMatrix should help me,
> > shouldn't it? do i need to transpose the first matrix first?
> >
>
> If you have a dense matrix V of eigenvectors (ie, it has K (a small number
> like 100's) rows of dense vectors, each of which are cardinality M (which
> may large)), which is a DistributedRowMatrix, and you have your original
> document matrix C, which has N rows, each of which has cardinality M, then
> you actually need to take the transpose of *both* matrices, then take
> the DistributedRowMatrix.times() on these:
>
>  V_transpose = V.transpose();
>  C_transpose = C.transpose();
>  C_times_V_transpose = C_transpose.times(V_transpose);
>
> This code will yield the mathematical result of C * V^T, which is probably
> what you want.
>
> (it turns out that this set of operations could also be done in a custom
> operation
> using the row-paths of both V and C as inputs, but you'd still require two
> MapReduce shuffles to get the answer, so it's not really a savings to do
> this).
>
>  -jake
>