You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Matt Molek <mp...@gmail.com> on 2012/10/19 18:06:06 UTC

How to use ssvd for dimensionality reduction of tfidf-vectors?

Sorry for the basic question. I've been reading about this for a few hours,
but I'm still confused. I want to use ssvd to reduce the dimensionality of
some tfidf-vectors so I can perform clustering on the result.

Among many other things, I've read:
https://cwiki.apache.org/MAHOUT/dimensional-reduction.html

Which states the process for svd is:

bin/mahout svd (original -> svdOut)
bin/mahout cleansvd ...
bin/mahout transpose svdOut -> svdT
bin/mahout transpose original -> originalT
bin/mahout matrixmult originalT svdT -> newMatrix
bin/mahout kmeans newMatrix

I know you don't need to do cleansvd with ssvd output. My main question is
which of the three outputs of ssvd should I be transposing and multiplying
with the original tfidf-matrix? I'm having trouble understanding the math
that's going on.

ssvd outputs U, V, and sigma, and despite reading a bunch, I'm still
confused on which of these outputs I should be using, and how. Could anyone
spell it out for me?

Thanks for any help,
Matt

Re: How to use ssvd for dimensionality reduction of tfidf-vectors?

Posted by Pat Ferrel <pa...@gmail.com>.

Let me go out on a limb and explain my understanding in layman's terms, hopefully someone will correct me where I have erred...

What Dmitriy describes below creates a matrix "output".  This is your original matrix transformed into the new reduced dimensionality space. It will have a row for each of the original "input" docs/items and a column for each of the reduced basis vectors (technically right eigen-vectors I believe). So it will have 80 columns in the example below. The reduction happens in the ssvd job by throwing away the least significant new basis vectors keeping 80. The job keeps the vectors which retain the most variance, which is a good measure of how well the reduced vectors characterize the original input data.

Dmitriy also is applying principal component analysis, which creates a factorization where the retained basis vectors are as uncorrelated or independent as possible, in other words they are orthogonal. (http://en.wikipedia.org/wiki/Principal_component_analysis).

If you are doing a one-way transform you don't need to do anything with V (or U or Sigma). But conceptually at least the output matrix conforms to them. In other words input ≈ USV^t, where U, S, and V are approximations of the true mathematical values and are calculated by the mahout stochastic singular value decomposition. Further input ≈ output * V^t but don't try to actually perform this operation (read below).

You can then operate on the output matrix as you would on the original but remember that the columns no longer correspond to the original input columns but are expressed as weights of the new basis vectors.

If you want to transform a particular row vector or several from the output matrix back into tfidf term space you can create a matrix of the vectors you want to transform and multiply them by V^t (V^t is the pseudo inverse of V since V is orthonormal but it is not square since V was reduced). Obviously you would want to tell the ssvd job to calculate V if you need it for this. But remember that the reduced row vectors in "output" are dense and V^t is dense so the multiply will take a relatively long time--don't do it for all your output.

In my case I take output and run it through clustering, which creates clusters with centroids. I then plan to transform the centroids back into the original tfidf space as described above so I can see their weights in relation to the original tfidf terms. This last bit I haven't done yet.

Perhaps I'll learn something from the corrections to come.

On Oct 19, 2012, at 9:48 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Ssvd process for dimentionality reduction is easier. Assuming your data
points are row vectors lf the input (which is the case with outpht of
Mahout's seq2sparse) you need the U*Sigma output of the pca flow.

I.e. You need something like
mahout ssvd -i input -o output -k 80 -pca true -us true -U false -V false...

This information is also in the latest ssvd manual on wiki.

Take latest trunk. Some of pca flow components got broken recently and i
fixed them just last week.
On Oct 19, 2012 9:06 AM, "Matt Molek" <mp...@gmail.com> wrote:

> Sorry for the basic question. I've been reading about this for a few hours,
> but I'm still confused. I want to use ssvd to reduce the dimensionality of
> some tfidf-vectors so I can perform clustering on the result.
> 
> Among many other things, I've read:
> https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
> 
> Which states the process for svd is:
> 
> bin/mahout svd (original -> svdOut)
> bin/mahout cleansvd ...
> bin/mahout transpose svdOut -> svdT
> bin/mahout transpose original -> originalT
> bin/mahout matrixmult originalT svdT -> newMatrix
> bin/mahout kmeans newMatrix
> 
> I know you don't need to do cleansvd with ssvd output. My main question is
> which of the three outputs of ssvd should I be transposing and multiplying
> with the original tfidf-matrix? I'm having trouble understanding the math
> that's going on.
> 
> ssvd outputs U, V, and sigma, and despite reading a bunch, I'm still
> confused on which of these outputs I should be using, and how. Could anyone
> spell it out for me?
> 
> Thanks for any help,
> Matt
>

Re: How to use ssvd for dimensionality reduction of tfidf-vectors?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Ssvd process for dimentionality reduction is easier. Assuming your data
points are row vectors lf the input (which is the case with outpht of
Mahout's seq2sparse) you need the U*Sigma output of the pca flow.

I.e. You need something like
mahout ssvd -i input -o output -k 80 -pca true -us true -U false -V false...

This information is also in the latest ssvd manual on wiki.

Take latest trunk. Some of pca flow components got broken recently and i
fixed them just last week.
On Oct 19, 2012 9:06 AM, "Matt Molek" <mp...@gmail.com> wrote:

> Sorry for the basic question. I've been reading about this for a few hours,
> but I'm still confused. I want to use ssvd to reduce the dimensionality of
> some tfidf-vectors so I can perform clustering on the result.
>
> Among many other things, I've read:
> https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
>
> Which states the process for svd is:
>
> bin/mahout svd (original -> svdOut)
> bin/mahout cleansvd ...
> bin/mahout transpose svdOut -> svdT
> bin/mahout transpose original -> originalT
> bin/mahout matrixmult originalT svdT -> newMatrix
> bin/mahout kmeans newMatrix
>
> I know you don't need to do cleansvd with ssvd output. My main question is
> which of the three outputs of ssvd should I be transposing and multiplying
> with the original tfidf-matrix? I'm having trouble understanding the math
> that's going on.
>
> ssvd outputs U, V, and sigma, and despite reading a bunch, I'm still
> confused on which of these outputs I should be using, and how. Could anyone
> spell it out for me?
>
> Thanks for any help,
> Matt
>

Re: How to use ssvd for dimensionality reduction of tfidf-vectors?

Posted by Chris Hokamp <ch...@gmail.com>.

Hi Matt,

I ran into the same issue a few months ago. Here's the thread from the
mailing list archives [1]. Also, check out this pdf [2] - it's more
explicit regarding the functionality of the various command line params for
ssvd.

Cheers,
Chris

[1]
http://mail-archives.apache.org/mod_mbox/mahout-user/201206.mbox/%3CCABCMrkNW7WwpNWDAmbGNKSLb17ijB+SbcEp9SMdg2yEvU9cq9A@mail.gmail.com%3E
[2]
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCgQFjAB&url=https%3A%2F%2Fcwiki.apache.org%2FMAHOUT%2Fstochastic-singular-value-decomposition.data%2FSSVD-CLI.pdf&ei=bHuBULqUJtCChQfO2ICADw&usg=AFQjCNGXl96KMqCNt9V6KmSNwLkPyjk7sA&cad=rja

On Fri, Oct 19, 2012 at 11:06 AM, Matt Molek <mp...@gmail.com> wrote:

> Sorry for the basic question. I've been reading about this for a few hours,
> but I'm still confused. I want to use ssvd to reduce the dimensionality of
> some tfidf-vectors so I can perform clustering on the result.
>
> Among many other things, I've read:
> https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
>
> Which states the process for svd is:
>
> bin/mahout svd (original -> svdOut)
> bin/mahout cleansvd ...
> bin/mahout transpose svdOut -> svdT
> bin/mahout transpose original -> originalT
> bin/mahout matrixmult originalT svdT -> newMatrix
> bin/mahout kmeans newMatrix
>
> I know you don't need to do cleansvd with ssvd output. My main question is
> which of the three outputs of ssvd should I be transposing and multiplying
> with the original tfidf-matrix? I'm having trouble understanding the math
> that's going on.
>
> ssvd outputs U, V, and sigma, and despite reading a bunch, I'm still
> confused on which of these outputs I should be using, and how. Could anyone
> spell it out for me?
>
> Thanks for any help,
> Matt
>