You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Stefan Wienert <st...@wienert.cc> on 2011/06/05 22:16:41 UTC

Need a little help with SVD / Dimensional Reduction

Hi,

after reading this:
https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

This looks familiar to LSA/LSI, but I have some questions:

In this example, the "tfidf-vectors"-matrix has 6,076,937 rows and
20,444 columns.
My first question is: Do the rows represent the documents and the
columns the terms in a traditional term-document-matrix?

so after the svd job, you got these 87 eigenvectors with each 20,444
columns (representing the terms).
These seem to be the eigenvectors of tfidf-vectors but reduced to only
87 documents? What is this mathematically?

and so, why do you calculate tfidf-vectors^T * svdOut^T? I do not find
myself an explanation

compared to SVD, is the result is the "right singular value"?

I know it works, but I don't understand some of these steps. Please help... :)

-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838 (neue Nummer seit 20.06.10)
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
Hi.

Thanks for the help.

The important points from wikipedia are:
- The left singular vectors of M are eigenvectors of M*M' .
- The right singular vectors of M are eigenvectors of M'*M.

as you describe, the mahout lanczos solver calculate A=M'*M (I think
it does A=M*M', but it is not a problem). Therefore it does already
calculate the right (or left) singular vector of M.

But my question is, how can I get the other singular vector? I can
transpose M, but then I have to calculated two SVDs, one for the right
and one for the left singular value... I think there is a better way
:)

Hope you can help me with this...
Thanks
Stefan


2011/6/6 Danny Bickson <da...@gmail.com>:
> Hi
> Mahout SVD implementation computes the Lanzcos iteration:
> http://en.wikipedia.org/wiki/Lanczos_algorithm
> Denote the non-square input matrix as M. First a symmetric matrix A is
> computed by A=M'*M
> Then an approximating tridiagonal matrix T and a vector matrix V are
> computed such that A =~ V*T*V'
> (this process is done in a distributed way).
>
> Next the matrix T is next decomposed into eigenvectors and eignevalues.
> Which is the returned result. (This process
> is serial).
>
> The third step makes the returned eigenvectors orthogonal to each other
> (which is optional IMHO).
>
> The heart of the code is found at:
> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
> At least that is where it was in version 0.4 I am not sure if there are
> changes in version 0.5
>
> Anyway, Mahout does not compute directly SVD. If you are interested in
> learning more about the relation to SVD
> look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
> subsection: relation to eigenvalue decomposition.
>
> Hope this helps,
>
> Danny Bickson
>
> On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc> wrote:
>
>> After reading this thread:
>>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>
>> Wiki-SVD: M = U S V* (* = transposed)
>>
>> The output of Mahout-SVD is (U S) right?
>>
>> So... How do I get V from (U S)  and M?
>>
>> Is V = M (U S)* (because this is, what the calculation in the example is)?
>>
>> Thanks
>> Stefan
>>
>> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>> >
>> > What is done:
>> >
>> > Input:
>> > tf-idf-matrix (docs x terms) 6076937 x 20444
>> >
>> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>> > eigenvalues) of tf-idf-matrix, called:
>> > svd (concepts x terms) 87 x 20444
>> >
>> > transpose tf-idf-matrix:
>> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>> >
>> > transpose svd:
>> > svd-transpose (terms x concepts) 20444 x 87
>> >
>> > matrix multiply:
>> > tf-idf-matrix-transpose x svd-transpose = result
>> > (terms x docs) x (terms x concepts) = (docs x concepts)
>> >
>> > so... I do understand, that the "svd" here is not SVD from wikipedia.
>> > It only does the Lanczos algorithm and some magic which produces the
>> >> Instead either the left or right (but usually the right) eigenvectors
>> premultiplied by the diagonal or the square root of the
>> >> diagonal element.
>> > from
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>> >
>> > so my question: what is the output of the SVD in mahout. And what do I
>> > have to calculate to get the "right singular value" from svd?
>> >
>> > Thanks,
>> > Stefan
>> >
>> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>> >>
>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>> >>
>> >> the last step is the matrix multiplication:
>> >>  --arg --numRowsA --arg 20444 \
>> >>  --arg --numColsA --arg 6076937 \
>> >>  --arg --numRowsB --arg 20444 \
>> >>  --arg --numColsB --arg 87 \
>> >> so the result is a 6,076,937 x 87 matrix
>> >>
>> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>> >> matrix multiplication has to be the right singular value regarding to
>> >> the dimensions.
>> >>
>> >> so the result is the "concept-document vector matrix" (as I think,
>> >> these is also called "document vectors" ?)
>> >>
>> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>> >>> Yes.  These are term vectors, not document vectors.
>> >>>
>> >>> There is an additional step that can be run to produce document
>> vectors.
>> >>>
>> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <st...@wienert.cc>
>> wrote:
>> >>>
>> >>>> compared to SVD, is the result is the "right singular value"?
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Stefan Wienert
>> >>
>> >> http://www.wienert.cc
>> >> stefan@wienert.cc
>> >>
>> >> Telefon: +495251-2026838
>> >> Mobil: +49176-40170270
>> >>
>> >
>> >
>> >
>> > --
>> > Stefan Wienert
>> >
>> > http://www.wienert.cc
>> > stefan@wienert.cc
>> >
>> > Telefon: +495251-2026838
>> > Mobil: +49176-40170270
>> >
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
i guess the only problem that creates such demand for CDH is the fact
that hadoop project twisted everybody's arm by deprecating the entire
MR api over what seems to be just perceived OOA design issues but not
functional issues. Even that would've been ok if it weren't for the
fact that they did not provide a full replacement of the functionality
in the new API for well over a year and 0.21 is which should contain
most of replacements is recognized not to be production grade.



On Wed, Jun 8, 2011 at 12:27 AM, Sean Owen <sr...@gmail.com> wrote:
> Hadoop is Hadoop, so I don't know that any roadmap inconsistency between CDH
> and Hadoop is somehow Hadoop's fault.
>
> I don't think it's this ambiguous. Mahout runs on 0.20.2 Amazon EMR runs
> 0.20.2. The latest Hadoop version is 0.20.203.0. CDH is indeed somewhere
> inbetween but that's CDH.
>
> On Wed, Jun 8, 2011 at 2:46 AM, Lance Norskog <go...@gmail.com> wrote:
>
>> CDH is the Cloudera distribution of Hadoop. The Hadoop people screwed
>> up their "forward motion" in api design and so now the Hadoop versions
>> are screwy. The Cloudera distribution is Hadoop 0.20.something plus
>> various bug fixes and features. Mahout runs on normal 0.20.something
>> and Cloudera. Amazon Elastic Map-Reduce is also 0.20.whatsit.
>>
>> Lance
>>
>>
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Sean Owen <sr...@gmail.com>.
Hadoop is Hadoop, so I don't know that any roadmap inconsistency between CDH
and Hadoop is somehow Hadoop's fault.

I don't think it's this ambiguous. Mahout runs on 0.20.2 Amazon EMR runs
0.20.2. The latest Hadoop version is 0.20.203.0. CDH is indeed somewhere
inbetween but that's CDH.

On Wed, Jun 8, 2011 at 2:46 AM, Lance Norskog <go...@gmail.com> wrote:

> CDH is the Cloudera distribution of Hadoop. The Hadoop people screwed
> up their "forward motion" in api design and so now the Hadoop versions
> are screwy. The Cloudera distribution is Hadoop 0.20.something plus
> various bug fixes and features. Mahout runs on normal 0.20.something
> and Cloudera. Amazon Elastic Map-Reduce is also 0.20.whatsit.
>
> Lance
>
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Lance Norskog <go...@gmail.com>.
CDH is the Cloudera distribution of Hadoop. The Hadoop people screwed
up their "forward motion" in api design and so now the Hadoop versions
are screwy. The Cloudera distribution is Hadoop 0.20.something plus
various bug fixes and features. Mahout runs on normal 0.20.something
and Cloudera. Amazon Elastic Map-Reduce is also 0.20.whatsit.

Lance

On Tue, Jun 7, 2011 at 1:23 PM, Stefan Wienert <st...@wienert.cc> wrote:
> Hmm... I think it is no problem to implement my tool so it works with
> both versions.
>
> My intention is to get document similarity using clustering or cosine
> similarity (and before, dimension reduction via svd).
>
> So I think there is no problem with a less precise svd if it is fast :)
>
> At least you have exactly described the parallelization process which
> would help me a lot writing this part of my master thesis ;).
>
> I will try to implement this tomorrow...
>
> Thanks
> Stefan
>
> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>> You can check what i summed up in javadoc for SSVDSolver to get some idea.
>>
>> but on top what i have already mentioned, the main difference is that
>> it presumably doesn't require as much resources (reuters dataset runs
>> thru with default -Xmx200m per task setting just fine) and theoretical
>> flops are quite less than that of Lanczos which probably means it
>> should rip thru the same volume faster. (my practical run showed that
>> it takes about the same time as seq2sparse that produced it's input,
>> and that was before sparse vector bug fix which caused significant
>> slow down on sparse data).
>>
>> I don't think you should 'modify' your progam, especially if you need
>> to show practical result within a week. but you might help me by
>> creating a version of your program that takes it and we can see
>> together what it takes to get it working for you.
>>
>> -d
>>
>> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>> Before I rewrite my program, is there any advantage over the lanczos svd?
>>>
>>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>>> I am saying i did not test it with 0.20.2
>>>>
>>>> Yes it is integrated in 0.5 release but there might be problems with
>>>> hadoop 0.20.2
>>>>
>>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>> Hmm... looks nice...
>>>>>
>>>>> So there is a Lanczos implementation of SVD and the stochastic version
>>>>> of SVD. Both produce the doc-concept vectors that I need.
>>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have
>>>>> to do some magic to get IntWritable, VectorWritable).
>>>>>
>>>>> Still, I do not exactly understand what you trying to say. Your SSVD
>>>>> does not run with mahout but on CDH (what is this btw?)? Or is it
>>>>> available for mahout? So what has to be modified to run it with
>>>>> mahout?
>>>>>
>>>>> Thanks
>>>>> Stefan
>>>>>
>>>>>
>>>>>
>>>>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>>>>>> produce exact U, V and Sigma ,
>>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>>>>>
>>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>>>>>> necessarily DoubleWritable (you can still keep your original document
>>>>>> paths or whatever they are identified with), that info is migrated as
>>>>>> keys of the output of U matrix. So you can run it directly on the
>>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters
>>>>>> dataset example from Mahout In Action).
>>>>>>
>>>>>> The computation has a stochastic noise to it, but LSI problems are
>>>>>> never exact problems anyway and their result are subject to corpus
>>>>>> selection.
>>>>>>
>>>>>> If you are interested to help me to work kinks out in Mahout's
>>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>>>>>> CDH in a customized way. But be warned that it may require a number of
>>>>>> patches before it works for you.
>>>>>>
>>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> -D
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>>>>>
>>>>>>> And now your questions:
>>>>>>>
>>>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>>>>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>>>>>> similarity to find similar documents. I can also cluster these
>>>>>>> documents with k-means...
>>>>>>>
>>>>>>> If you have any suggestions, feel free to tell me. I'm also interested
>>>>>>> in other document-similarity-techniques.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stefan
>>>>>>>
>>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>> If I understand your question correctly, you need to simply transpose M
>>>>>>>> before you start the run and that way
>>>>>>>> you will get the other singular vectors.
>>>>>>>>
>>>>>>>> May I ask what is the problem you are working on and why do you need the
>>>>>>>> singular vectors?
>>>>>>>> Can you consider using another matrix decomposition technique for example
>>>>>>>> alternating least squares
>>>>>>>> which gives you two lower rank matrices which simulates the large decomposed
>>>>>>>> matrix?
>>>>>>>>
>>>>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>>>>
>>>>>>>>> Hi Danny!
>>>>>>>>>
>>>>>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>>>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>>>>>> are the left singular vectors of M. But I need the right singular
>>>>>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>>>>>
>>>>>>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>>>>>>> can help me!
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>>> > Hi Stefan!
>>>>>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>>>>>>>> > identical.
>>>>>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>>>>>> decomposition.
>>>>>>>>> > So you don't need to be worried about the other singular vectors.
>>>>>>>>> >
>>>>>>>>> > Hope this helps!
>>>>>>>>> >
>>>>>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> >> Hi.
>>>>>>>>> >>
>>>>>>>>> >> Thanks for the help.
>>>>>>>>> >>
>>>>>>>>> >> The important points from wikipedia are:
>>>>>>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>>>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>>>>>> >>
>>>>>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>>>>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>>>>>>> >> calculate the right (or left) singular vector of M.
>>>>>>>>> >>
>>>>>>>>> >> But my question is, how can I get the other singular vector? I can
>>>>>>>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>>>>>>>> >> and one for the left singular value... I think there is a better way
>>>>>>>>> >> :)
>>>>>>>>> >>
>>>>>>>>> >> Hope you can help me with this...
>>>>>>>>> >> Thanks
>>>>>>>>> >> Stefan
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>>> >> > Hi
>>>>>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>>>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>>>>>>>>> >> > computed by A=M'*M
>>>>>>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>>>>>>>> >> > computed such that A =~ V*T*V'
>>>>>>>>> >> > (this process is done in a distributed way).
>>>>>>>>> >> >
>>>>>>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>>>>>>> eignevalues.
>>>>>>>>> >> > Which is the returned result. (This process
>>>>>>>>> >> > is serial).
>>>>>>>>> >> >
>>>>>>>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>>>>>>>> other
>>>>>>>>> >> > (which is optional IMHO).
>>>>>>>>> >> >
>>>>>>>>> >> > The heart of the code is found at:
>>>>>>>>> >> >
>>>>>>>>> >>
>>>>>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>>>>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>>>>>>>> are
>>>>>>>>> >> > changes in version 0.5
>>>>>>>>> >> >
>>>>>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>>>>>>>>> >> > learning more about the relation to SVD
>>>>>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>>>>>> >> >
>>>>>>>>> >> > Hope this helps,
>>>>>>>>> >> >
>>>>>>>>> >> > Danny Bickson
>>>>>>>>> >> >
>>>>>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>>>>>>>>> >> wrote:
>>>>>>>>> >> >
>>>>>>>>> >> >> After reading this thread:
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >>
>>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>>>>>> >> >>
>>>>>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>>>>>> >> >>
>>>>>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>>>>>> >> >>
>>>>>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>>>>>>>>> >> is)?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Thanks
>>>>>>>>> >> >> Stefan
>>>>>>>>> >> >>
>>>>>>>>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>>>> >> >> >
>>>>>>>>> >>
>>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > What is done:
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Input:
>>>>>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>>>>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > transpose tf-idf-matrix:
>>>>>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > transpose svd:
>>>>>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > matrix multiply:
>>>>>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>>>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>>>>>>>> wikipedia.
>>>>>>>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>>>>>>>> the
>>>>>>>>> >> >> >> Instead either the left or right (but usually the right)
>>>>>>>>> eigenvectors
>>>>>>>>> >> >> premultiplied by the diagonal or the square root of the
>>>>>>>>> >> >> >> diagonal element.
>>>>>>>>> >> >> > from
>>>>>>>>> >> >>
>>>>>>>>> >>
>>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>>>>>>>> do I
>>>>>>>>> >> >> > have to calculate to get the "right singular value" from svd?
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Thanks,
>>>>>>>>> >> >> > Stefan
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>>>> >> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >>
>>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> the last step is the matrix multiplication:
>>>>>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>>>>>>>>> >> >> >> matrix multiplication has to be the right singular value regarding
>>>>>>>>> to
>>>>>>>>> >> >> >> the dimensions.
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>>>>>>>>> >> >> >> these is also called "document vectors" ?)
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>>>>>>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>>>>>>>> >> >> >>>
>>>>>>>>> >> >> >>> There is an additional step that can be run to produce document
>>>>>>>>> >> >> vectors.
>>>>>>>>> >> >> >>>
>>>>>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>>>>>>>>> >
>>>>>>>>> >> >> wrote:
>>>>>>>>> >> >> >>>
>>>>>>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>>>>>>>> >> >> >>>>
>>>>>>>>> >> >> >>>
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> --
>>>>>>>>> >> >> >> Stefan Wienert
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> http://www.wienert.cc
>>>>>>>>> >> >> >> stefan@wienert.cc
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> Telefon: +495251-2026838
>>>>>>>>> >> >> >> Mobil: +49176-40170270
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > --
>>>>>>>>> >> >> > Stefan Wienert
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > http://www.wienert.cc
>>>>>>>>> >> >> > stefan@wienert.cc
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Telefon: +495251-2026838
>>>>>>>>> >> >> > Mobil: +49176-40170270
>>>>>>>>> >> >> >
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> --
>>>>>>>>> >> >> Stefan Wienert
>>>>>>>>> >> >>
>>>>>>>>> >> >> http://www.wienert.cc
>>>>>>>>> >> >> stefan@wienert.cc
>>>>>>>>> >> >>
>>>>>>>>> >> >> Telefon: +495251-2026838
>>>>>>>>> >> >> Mobil: +49176-40170270
>>>>>>>>> >> >>
>>>>>>>>> >> >
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> --
>>>>>>>>> >> Stefan Wienert
>>>>>>>>> >>
>>>>>>>>> >> http://www.wienert.cc
>>>>>>>>> >> stefan@wienert.cc
>>>>>>>>> >>
>>>>>>>>> >> Telefon: +495251-2026838
>>>>>>>>> >> Mobil: +49176-40170270
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Stefan Wienert
>>>>>>>>>
>>>>>>>>> http://www.wienert.cc
>>>>>>>>> stefan@wienert.cc
>>>>>>>>>
>>>>>>>>> Telefon: +495251-2026838
>>>>>>>>> Mobil: +49176-40170270
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> stefan@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Wienert
>>>>>
>>>>> http://www.wienert.cc
>>>>> stefan@wienert.cc
>>>>>
>>>>> Telefon: +495251-2026838
>>>>> Mobil: +49176-40170270
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>



-- 
Lance Norskog
goksron@gmail.com

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
Hmm... I think it is no problem to implement my tool so it works with
both versions.

My intention is to get document similarity using clustering or cosine
similarity (and before, dimension reduction via svd).

So I think there is no problem with a less precise svd if it is fast :)

At least you have exactly described the parallelization process which
would help me a lot writing this part of my master thesis ;).

I will try to implement this tomorrow...

Thanks
Stefan

2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
> You can check what i summed up in javadoc for SSVDSolver to get some idea.
>
> but on top what i have already mentioned, the main difference is that
> it presumably doesn't require as much resources (reuters dataset runs
> thru with default -Xmx200m per task setting just fine) and theoretical
> flops are quite less than that of Lanczos which probably means it
> should rip thru the same volume faster. (my practical run showed that
> it takes about the same time as seq2sparse that produced it's input,
> and that was before sparse vector bug fix which caused significant
> slow down on sparse data).
>
> I don't think you should 'modify' your progam, especially if you need
> to show practical result within a week. but you might help me by
> creating a version of your program that takes it and we can see
> together what it takes to get it working for you.
>
> -d
>
> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <st...@wienert.cc> wrote:
>> Before I rewrite my program, is there any advantage over the lanczos svd?
>>
>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>> I am saying i did not test it with 0.20.2
>>>
>>> Yes it is integrated in 0.5 release but there might be problems with
>>> hadoop 0.20.2
>>>
>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>> Hmm... looks nice...
>>>>
>>>> So there is a Lanczos implementation of SVD and the stochastic version
>>>> of SVD. Both produce the doc-concept vectors that I need.
>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have
>>>> to do some magic to get IntWritable, VectorWritable).
>>>>
>>>> Still, I do not exactly understand what you trying to say. Your SSVD
>>>> does not run with mahout but on CDH (what is this btw?)? Or is it
>>>> available for mahout? So what has to be modified to run it with
>>>> mahout?
>>>>
>>>> Thanks
>>>> Stefan
>>>>
>>>>
>>>>
>>>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>>>>> produce exact U, V and Sigma ,
>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>>>>
>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>>>>> necessarily DoubleWritable (you can still keep your original document
>>>>> paths or whatever they are identified with), that info is migrated as
>>>>> keys of the output of U matrix. So you can run it directly on the
>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters
>>>>> dataset example from Mahout In Action).
>>>>>
>>>>> The computation has a stochastic noise to it, but LSI problems are
>>>>> never exact problems anyway and their result are subject to corpus
>>>>> selection.
>>>>>
>>>>> If you are interested to help me to work kinks out in Mahout's
>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>>>>> CDH in a customized way. But be warned that it may require a number of
>>>>> patches before it works for you.
>>>>>
>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>>>>
>>>>> Thanks.
>>>>>
>>>>> -D
>>>>>
>>>>>
>>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>>>>
>>>>>> And now your questions:
>>>>>>
>>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>>>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>>>>> similarity to find similar documents. I can also cluster these
>>>>>> documents with k-means...
>>>>>>
>>>>>> If you have any suggestions, feel free to tell me. I'm also interested
>>>>>> in other document-similarity-techniques.
>>>>>>
>>>>>> Cheers,
>>>>>> Stefan
>>>>>>
>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>> If I understand your question correctly, you need to simply transpose M
>>>>>>> before you start the run and that way
>>>>>>> you will get the other singular vectors.
>>>>>>>
>>>>>>> May I ask what is the problem you are working on and why do you need the
>>>>>>> singular vectors?
>>>>>>> Can you consider using another matrix decomposition technique for example
>>>>>>> alternating least squares
>>>>>>> which gives you two lower rank matrices which simulates the large decomposed
>>>>>>> matrix?
>>>>>>>
>>>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>>>
>>>>>>>> Hi Danny!
>>>>>>>>
>>>>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>>>>> are the left singular vectors of M. But I need the right singular
>>>>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>>>>
>>>>>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>>>>>> can help me!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>> > Hi Stefan!
>>>>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>>>>>>> > identical.
>>>>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>>>>> decomposition.
>>>>>>>> > So you don't need to be worried about the other singular vectors.
>>>>>>>> >
>>>>>>>> > Hope this helps!
>>>>>>>> >
>>>>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> >> Hi.
>>>>>>>> >>
>>>>>>>> >> Thanks for the help.
>>>>>>>> >>
>>>>>>>> >> The important points from wikipedia are:
>>>>>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>>>>> >>
>>>>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>>>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>>>>>> >> calculate the right (or left) singular vector of M.
>>>>>>>> >>
>>>>>>>> >> But my question is, how can I get the other singular vector? I can
>>>>>>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>>>>>>> >> and one for the left singular value... I think there is a better way
>>>>>>>> >> :)
>>>>>>>> >>
>>>>>>>> >> Hope you can help me with this...
>>>>>>>> >> Thanks
>>>>>>>> >> Stefan
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>> >> > Hi
>>>>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>>>>>>>> >> > computed by A=M'*M
>>>>>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>>>>>>> >> > computed such that A =~ V*T*V'
>>>>>>>> >> > (this process is done in a distributed way).
>>>>>>>> >> >
>>>>>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>>>>>> eignevalues.
>>>>>>>> >> > Which is the returned result. (This process
>>>>>>>> >> > is serial).
>>>>>>>> >> >
>>>>>>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>>>>>>> other
>>>>>>>> >> > (which is optional IMHO).
>>>>>>>> >> >
>>>>>>>> >> > The heart of the code is found at:
>>>>>>>> >> >
>>>>>>>> >>
>>>>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>>>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>>>>>>> are
>>>>>>>> >> > changes in version 0.5
>>>>>>>> >> >
>>>>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>>>>>>>> >> > learning more about the relation to SVD
>>>>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>>>>> >> >
>>>>>>>> >> > Hope this helps,
>>>>>>>> >> >
>>>>>>>> >> > Danny Bickson
>>>>>>>> >> >
>>>>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>>>>>>>> >> wrote:
>>>>>>>> >> >
>>>>>>>> >> >> After reading this thread:
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >>
>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>>>>> >> >>
>>>>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>>>>> >> >>
>>>>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>>>>> >> >>
>>>>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>>>>> >> >>
>>>>>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>>>>>>>> >> is)?
>>>>>>>> >> >>
>>>>>>>> >> >> Thanks
>>>>>>>> >> >> Stefan
>>>>>>>> >> >>
>>>>>>>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>>> >> >> >
>>>>>>>> >>
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>> >> >> >
>>>>>>>> >> >> > What is done:
>>>>>>>> >> >> >
>>>>>>>> >> >> > Input:
>>>>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>>>>> >> >> >
>>>>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>>>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>>>>> >> >> >
>>>>>>>> >> >> > transpose tf-idf-matrix:
>>>>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>>>>>> >> >> >
>>>>>>>> >> >> > transpose svd:
>>>>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>>>>> >> >> >
>>>>>>>> >> >> > matrix multiply:
>>>>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>>>>>> >> >> >
>>>>>>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>>>>>>> wikipedia.
>>>>>>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>>>>>>> the
>>>>>>>> >> >> >> Instead either the left or right (but usually the right)
>>>>>>>> eigenvectors
>>>>>>>> >> >> premultiplied by the diagonal or the square root of the
>>>>>>>> >> >> >> diagonal element.
>>>>>>>> >> >> > from
>>>>>>>> >> >>
>>>>>>>> >>
>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>>>>> >> >> >
>>>>>>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>>>>>>> do I
>>>>>>>> >> >> > have to calculate to get the "right singular value" from svd?
>>>>>>>> >> >> >
>>>>>>>> >> >> > Thanks,
>>>>>>>> >> >> > Stefan
>>>>>>>> >> >> >
>>>>>>>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>>> >> >> >>
>>>>>>>> >> >>
>>>>>>>> >>
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> the last step is the matrix multiplication:
>>>>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>>>>>>>> >> >> >> matrix multiplication has to be the right singular value regarding
>>>>>>>> to
>>>>>>>> >> >> >> the dimensions.
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>>>>>>>> >> >> >> these is also called "document vectors" ?)
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>>>>>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>> There is an additional step that can be run to produce document
>>>>>>>> >> >> vectors.
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>>>>>>>> >
>>>>>>>> >> >> wrote:
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>>>>>>> >> >> >>>>
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> --
>>>>>>>> >> >> >> Stefan Wienert
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> http://www.wienert.cc
>>>>>>>> >> >> >> stefan@wienert.cc
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> Telefon: +495251-2026838
>>>>>>>> >> >> >> Mobil: +49176-40170270
>>>>>>>> >> >> >>
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> > --
>>>>>>>> >> >> > Stefan Wienert
>>>>>>>> >> >> >
>>>>>>>> >> >> > http://www.wienert.cc
>>>>>>>> >> >> > stefan@wienert.cc
>>>>>>>> >> >> >
>>>>>>>> >> >> > Telefon: +495251-2026838
>>>>>>>> >> >> > Mobil: +49176-40170270
>>>>>>>> >> >> >
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> --
>>>>>>>> >> >> Stefan Wienert
>>>>>>>> >> >>
>>>>>>>> >> >> http://www.wienert.cc
>>>>>>>> >> >> stefan@wienert.cc
>>>>>>>> >> >>
>>>>>>>> >> >> Telefon: +495251-2026838
>>>>>>>> >> >> Mobil: +49176-40170270
>>>>>>>> >> >>
>>>>>>>> >> >
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >> Stefan Wienert
>>>>>>>> >>
>>>>>>>> >> http://www.wienert.cc
>>>>>>>> >> stefan@wienert.cc
>>>>>>>> >>
>>>>>>>> >> Telefon: +495251-2026838
>>>>>>>> >> Mobil: +49176-40170270
>>>>>>>> >>
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefan Wienert
>>>>>>>>
>>>>>>>> http://www.wienert.cc
>>>>>>>> stefan@wienert.cc
>>>>>>>>
>>>>>>>> Telefon: +495251-2026838
>>>>>>>> Mobil: +49176-40170270
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> stefan@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PPS another non-distributed implementation of this is redsvd (on
google code) which is claiming amazing benchmark speed but apparently
it hits memory ceiling at some point as i think i did not see a test
larger than 10kx10k dense matrix equivalent input there.

On Tue, Jun 7, 2011 at 1:14 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> The tradoff is of course the stochastic noise (lanczos would be
> significantly more precise, esp. on random data (i.e. without clear
> trends), but then why would you try to figure trends on a random
> data?), but you get to manage precision by increasing p parameter at
> expense of more memory and flops.
>
> (the reuters example i ran was done on k+p=200 singular values of
> which LSI probably should use no more than k=50).
>
> On Tue, Jun 7, 2011 at 1:08 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> You can check what i summed up in javadoc for SSVDSolver to get some idea.
>>
>> but on top what i have already mentioned, the main difference is that
>> it presumably doesn't require as much resources (reuters dataset runs
>> thru with default -Xmx200m per task setting just fine) and theoretical
>> flops are quite less than that of Lanczos which probably means it
>> should rip thru the same volume faster. (my practical run showed that
>> it takes about the same time as seq2sparse that produced it's input,
>> and that was before sparse vector bug fix which caused significant
>> slow down on sparse data).
>>
>> I don't think you should 'modify' your progam, especially if you need
>> to show practical result within a week. but you might help me by
>> creating a version of your program that takes it and we can see
>> together what it takes to get it working for you.
>>
>> -d
>>
>> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>> Before I rewrite my program, is there any advantage over the lanczos svd?
>>>
>>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>>> I am saying i did not test it with 0.20.2
>>>>
>>>> Yes it is integrated in 0.5 release but there might be problems with
>>>> hadoop 0.20.2
>>>>
>>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>> Hmm... looks nice...
>>>>>
>>>>> So there is a Lanczos implementation of SVD and the stochastic version
>>>>> of SVD. Both produce the doc-concept vectors that I need.
>>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have
>>>>> to do some magic to get IntWritable, VectorWritable).
>>>>>
>>>>> Still, I do not exactly understand what you trying to say. Your SSVD
>>>>> does not run with mahout but on CDH (what is this btw?)? Or is it
>>>>> available for mahout? So what has to be modified to run it with
>>>>> mahout?
>>>>>
>>>>> Thanks
>>>>> Stefan
>>>>>
>>>>>
>>>>>
>>>>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>>>>>> produce exact U, V and Sigma ,
>>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>>>>>
>>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>>>>>> necessarily DoubleWritable (you can still keep your original document
>>>>>> paths or whatever they are identified with), that info is migrated as
>>>>>> keys of the output of U matrix. So you can run it directly on the
>>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters
>>>>>> dataset example from Mahout In Action).
>>>>>>
>>>>>> The computation has a stochastic noise to it, but LSI problems are
>>>>>> never exact problems anyway and their result are subject to corpus
>>>>>> selection.
>>>>>>
>>>>>> If you are interested to help me to work kinks out in Mahout's
>>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>>>>>> CDH in a customized way. But be warned that it may require a number of
>>>>>> patches before it works for you.
>>>>>>
>>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> -D
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>>>>>
>>>>>>> And now your questions:
>>>>>>>
>>>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>>>>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>>>>>> similarity to find similar documents. I can also cluster these
>>>>>>> documents with k-means...
>>>>>>>
>>>>>>> If you have any suggestions, feel free to tell me. I'm also interested
>>>>>>> in other document-similarity-techniques.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stefan
>>>>>>>
>>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>> If I understand your question correctly, you need to simply transpose M
>>>>>>>> before you start the run and that way
>>>>>>>> you will get the other singular vectors.
>>>>>>>>
>>>>>>>> May I ask what is the problem you are working on and why do you need the
>>>>>>>> singular vectors?
>>>>>>>> Can you consider using another matrix decomposition technique for example
>>>>>>>> alternating least squares
>>>>>>>> which gives you two lower rank matrices which simulates the large decomposed
>>>>>>>> matrix?
>>>>>>>>
>>>>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>>>>
>>>>>>>>> Hi Danny!
>>>>>>>>>
>>>>>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>>>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>>>>>> are the left singular vectors of M. But I need the right singular
>>>>>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>>>>>
>>>>>>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>>>>>>> can help me!
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>>> > Hi Stefan!
>>>>>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>>>>>>>> > identical.
>>>>>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>>>>>> decomposition.
>>>>>>>>> > So you don't need to be worried about the other singular vectors.
>>>>>>>>> >
>>>>>>>>> > Hope this helps!
>>>>>>>>> >
>>>>>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> >> Hi.
>>>>>>>>> >>
>>>>>>>>> >> Thanks for the help.
>>>>>>>>> >>
>>>>>>>>> >> The important points from wikipedia are:
>>>>>>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>>>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>>>>>> >>
>>>>>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>>>>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>>>>>>> >> calculate the right (or left) singular vector of M.
>>>>>>>>> >>
>>>>>>>>> >> But my question is, how can I get the other singular vector? I can
>>>>>>>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>>>>>>>> >> and one for the left singular value... I think there is a better way
>>>>>>>>> >> :)
>>>>>>>>> >>
>>>>>>>>> >> Hope you can help me with this...
>>>>>>>>> >> Thanks
>>>>>>>>> >> Stefan
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>>> >> > Hi
>>>>>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>>>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>>>>>>>>> >> > computed by A=M'*M
>>>>>>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>>>>>>>> >> > computed such that A =~ V*T*V'
>>>>>>>>> >> > (this process is done in a distributed way).
>>>>>>>>> >> >
>>>>>>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>>>>>>> eignevalues.
>>>>>>>>> >> > Which is the returned result. (This process
>>>>>>>>> >> > is serial).
>>>>>>>>> >> >
>>>>>>>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>>>>>>>> other
>>>>>>>>> >> > (which is optional IMHO).
>>>>>>>>> >> >
>>>>>>>>> >> > The heart of the code is found at:
>>>>>>>>> >> >
>>>>>>>>> >>
>>>>>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>>>>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>>>>>>>> are
>>>>>>>>> >> > changes in version 0.5
>>>>>>>>> >> >
>>>>>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>>>>>>>>> >> > learning more about the relation to SVD
>>>>>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>>>>>> >> >
>>>>>>>>> >> > Hope this helps,
>>>>>>>>> >> >
>>>>>>>>> >> > Danny Bickson
>>>>>>>>> >> >
>>>>>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>>>>>>>>> >> wrote:
>>>>>>>>> >> >
>>>>>>>>> >> >> After reading this thread:
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >>
>>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>>>>>> >> >>
>>>>>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>>>>>> >> >>
>>>>>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>>>>>> >> >>
>>>>>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>>>>>>>>> >> is)?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Thanks
>>>>>>>>> >> >> Stefan
>>>>>>>>> >> >>
>>>>>>>>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>>>> >> >> >
>>>>>>>>> >>
>>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > What is done:
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Input:
>>>>>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>>>>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > transpose tf-idf-matrix:
>>>>>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > transpose svd:
>>>>>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > matrix multiply:
>>>>>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>>>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>>>>>>>> wikipedia.
>>>>>>>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>>>>>>>> the
>>>>>>>>> >> >> >> Instead either the left or right (but usually the right)
>>>>>>>>> eigenvectors
>>>>>>>>> >> >> premultiplied by the diagonal or the square root of the
>>>>>>>>> >> >> >> diagonal element.
>>>>>>>>> >> >> > from
>>>>>>>>> >> >>
>>>>>>>>> >>
>>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>>>>>>>> do I
>>>>>>>>> >> >> > have to calculate to get the "right singular value" from svd?
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Thanks,
>>>>>>>>> >> >> > Stefan
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>>>> >> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >>
>>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> the last step is the matrix multiplication:
>>>>>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>>>>>>>>> >> >> >> matrix multiplication has to be the right singular value regarding
>>>>>>>>> to
>>>>>>>>> >> >> >> the dimensions.
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>>>>>>>>> >> >> >> these is also called "document vectors" ?)
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>>>>>>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>>>>>>>> >> >> >>>
>>>>>>>>> >> >> >>> There is an additional step that can be run to produce document
>>>>>>>>> >> >> vectors.
>>>>>>>>> >> >> >>>
>>>>>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>>>>>>>>> >
>>>>>>>>> >> >> wrote:
>>>>>>>>> >> >> >>>
>>>>>>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>>>>>>>> >> >> >>>>
>>>>>>>>> >> >> >>>
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> --
>>>>>>>>> >> >> >> Stefan Wienert
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> http://www.wienert.cc
>>>>>>>>> >> >> >> stefan@wienert.cc
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> Telefon: +495251-2026838
>>>>>>>>> >> >> >> Mobil: +49176-40170270
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > --
>>>>>>>>> >> >> > Stefan Wienert
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > http://www.wienert.cc
>>>>>>>>> >> >> > stefan@wienert.cc
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Telefon: +495251-2026838
>>>>>>>>> >> >> > Mobil: +49176-40170270
>>>>>>>>> >> >> >
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> --
>>>>>>>>> >> >> Stefan Wienert
>>>>>>>>> >> >>
>>>>>>>>> >> >> http://www.wienert.cc
>>>>>>>>> >> >> stefan@wienert.cc
>>>>>>>>> >> >>
>>>>>>>>> >> >> Telefon: +495251-2026838
>>>>>>>>> >> >> Mobil: +49176-40170270
>>>>>>>>> >> >>
>>>>>>>>> >> >
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> --
>>>>>>>>> >> Stefan Wienert
>>>>>>>>> >>
>>>>>>>>> >> http://www.wienert.cc
>>>>>>>>> >> stefan@wienert.cc
>>>>>>>>> >>
>>>>>>>>> >> Telefon: +495251-2026838
>>>>>>>>> >> Mobil: +49176-40170270
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Stefan Wienert
>>>>>>>>>
>>>>>>>>> http://www.wienert.cc
>>>>>>>>> stefan@wienert.cc
>>>>>>>>>
>>>>>>>>> Telefon: +495251-2026838
>>>>>>>>> Mobil: +49176-40170270
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> stefan@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Wienert
>>>>>
>>>>> http://www.wienert.cc
>>>>> stefan@wienert.cc
>>>>>
>>>>> Telefon: +495251-2026838
>>>>> Mobil: +49176-40170270
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
The tradoff is of course the stochastic noise (lanczos would be
significantly more precise, esp. on random data (i.e. without clear
trends), but then why would you try to figure trends on a random
data?), but you get to manage precision by increasing p parameter at
expense of more memory and flops.

(the reuters example i ran was done on k+p=200 singular values of
which LSI probably should use no more than k=50).

On Tue, Jun 7, 2011 at 1:08 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> You can check what i summed up in javadoc for SSVDSolver to get some idea.
>
> but on top what i have already mentioned, the main difference is that
> it presumably doesn't require as much resources (reuters dataset runs
> thru with default -Xmx200m per task setting just fine) and theoretical
> flops are quite less than that of Lanczos which probably means it
> should rip thru the same volume faster. (my practical run showed that
> it takes about the same time as seq2sparse that produced it's input,
> and that was before sparse vector bug fix which caused significant
> slow down on sparse data).
>
> I don't think you should 'modify' your progam, especially if you need
> to show practical result within a week. but you might help me by
> creating a version of your program that takes it and we can see
> together what it takes to get it working for you.
>
> -d
>
> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <st...@wienert.cc> wrote:
>> Before I rewrite my program, is there any advantage over the lanczos svd?
>>
>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>> I am saying i did not test it with 0.20.2
>>>
>>> Yes it is integrated in 0.5 release but there might be problems with
>>> hadoop 0.20.2
>>>
>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>> Hmm... looks nice...
>>>>
>>>> So there is a Lanczos implementation of SVD and the stochastic version
>>>> of SVD. Both produce the doc-concept vectors that I need.
>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have
>>>> to do some magic to get IntWritable, VectorWritable).
>>>>
>>>> Still, I do not exactly understand what you trying to say. Your SSVD
>>>> does not run with mahout but on CDH (what is this btw?)? Or is it
>>>> available for mahout? So what has to be modified to run it with
>>>> mahout?
>>>>
>>>> Thanks
>>>> Stefan
>>>>
>>>>
>>>>
>>>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>>>>> produce exact U, V and Sigma ,
>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>>>>
>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>>>>> necessarily DoubleWritable (you can still keep your original document
>>>>> paths or whatever they are identified with), that info is migrated as
>>>>> keys of the output of U matrix. So you can run it directly on the
>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters
>>>>> dataset example from Mahout In Action).
>>>>>
>>>>> The computation has a stochastic noise to it, but LSI problems are
>>>>> never exact problems anyway and their result are subject to corpus
>>>>> selection.
>>>>>
>>>>> If you are interested to help me to work kinks out in Mahout's
>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>>>>> CDH in a customized way. But be warned that it may require a number of
>>>>> patches before it works for you.
>>>>>
>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>>>>
>>>>> Thanks.
>>>>>
>>>>> -D
>>>>>
>>>>>
>>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>>>>
>>>>>> And now your questions:
>>>>>>
>>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>>>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>>>>> similarity to find similar documents. I can also cluster these
>>>>>> documents with k-means...
>>>>>>
>>>>>> If you have any suggestions, feel free to tell me. I'm also interested
>>>>>> in other document-similarity-techniques.
>>>>>>
>>>>>> Cheers,
>>>>>> Stefan
>>>>>>
>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>> If I understand your question correctly, you need to simply transpose M
>>>>>>> before you start the run and that way
>>>>>>> you will get the other singular vectors.
>>>>>>>
>>>>>>> May I ask what is the problem you are working on and why do you need the
>>>>>>> singular vectors?
>>>>>>> Can you consider using another matrix decomposition technique for example
>>>>>>> alternating least squares
>>>>>>> which gives you two lower rank matrices which simulates the large decomposed
>>>>>>> matrix?
>>>>>>>
>>>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>>>
>>>>>>>> Hi Danny!
>>>>>>>>
>>>>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>>>>> are the left singular vectors of M. But I need the right singular
>>>>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>>>>
>>>>>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>>>>>> can help me!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>> > Hi Stefan!
>>>>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>>>>>>> > identical.
>>>>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>>>>> decomposition.
>>>>>>>> > So you don't need to be worried about the other singular vectors.
>>>>>>>> >
>>>>>>>> > Hope this helps!
>>>>>>>> >
>>>>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> >> Hi.
>>>>>>>> >>
>>>>>>>> >> Thanks for the help.
>>>>>>>> >>
>>>>>>>> >> The important points from wikipedia are:
>>>>>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>>>>> >>
>>>>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>>>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>>>>>> >> calculate the right (or left) singular vector of M.
>>>>>>>> >>
>>>>>>>> >> But my question is, how can I get the other singular vector? I can
>>>>>>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>>>>>>> >> and one for the left singular value... I think there is a better way
>>>>>>>> >> :)
>>>>>>>> >>
>>>>>>>> >> Hope you can help me with this...
>>>>>>>> >> Thanks
>>>>>>>> >> Stefan
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>>> >> > Hi
>>>>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>>>>>>>> >> > computed by A=M'*M
>>>>>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>>>>>>> >> > computed such that A =~ V*T*V'
>>>>>>>> >> > (this process is done in a distributed way).
>>>>>>>> >> >
>>>>>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>>>>>> eignevalues.
>>>>>>>> >> > Which is the returned result. (This process
>>>>>>>> >> > is serial).
>>>>>>>> >> >
>>>>>>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>>>>>>> other
>>>>>>>> >> > (which is optional IMHO).
>>>>>>>> >> >
>>>>>>>> >> > The heart of the code is found at:
>>>>>>>> >> >
>>>>>>>> >>
>>>>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>>>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>>>>>>> are
>>>>>>>> >> > changes in version 0.5
>>>>>>>> >> >
>>>>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>>>>>>>> >> > learning more about the relation to SVD
>>>>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>>>>> >> >
>>>>>>>> >> > Hope this helps,
>>>>>>>> >> >
>>>>>>>> >> > Danny Bickson
>>>>>>>> >> >
>>>>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>>>>>>>> >> wrote:
>>>>>>>> >> >
>>>>>>>> >> >> After reading this thread:
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >>
>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>>>>> >> >>
>>>>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>>>>> >> >>
>>>>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>>>>> >> >>
>>>>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>>>>> >> >>
>>>>>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>>>>>>>> >> is)?
>>>>>>>> >> >>
>>>>>>>> >> >> Thanks
>>>>>>>> >> >> Stefan
>>>>>>>> >> >>
>>>>>>>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>>> >> >> >
>>>>>>>> >>
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>> >> >> >
>>>>>>>> >> >> > What is done:
>>>>>>>> >> >> >
>>>>>>>> >> >> > Input:
>>>>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>>>>> >> >> >
>>>>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>>>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>>>>> >> >> >
>>>>>>>> >> >> > transpose tf-idf-matrix:
>>>>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>>>>>> >> >> >
>>>>>>>> >> >> > transpose svd:
>>>>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>>>>> >> >> >
>>>>>>>> >> >> > matrix multiply:
>>>>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>>>>>> >> >> >
>>>>>>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>>>>>>> wikipedia.
>>>>>>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>>>>>>> the
>>>>>>>> >> >> >> Instead either the left or right (but usually the right)
>>>>>>>> eigenvectors
>>>>>>>> >> >> premultiplied by the diagonal or the square root of the
>>>>>>>> >> >> >> diagonal element.
>>>>>>>> >> >> > from
>>>>>>>> >> >>
>>>>>>>> >>
>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>>>>> >> >> >
>>>>>>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>>>>>>> do I
>>>>>>>> >> >> > have to calculate to get the "right singular value" from svd?
>>>>>>>> >> >> >
>>>>>>>> >> >> > Thanks,
>>>>>>>> >> >> > Stefan
>>>>>>>> >> >> >
>>>>>>>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>>> >> >> >>
>>>>>>>> >> >>
>>>>>>>> >>
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> the last step is the matrix multiplication:
>>>>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>>>>>>>> >> >> >> matrix multiplication has to be the right singular value regarding
>>>>>>>> to
>>>>>>>> >> >> >> the dimensions.
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>>>>>>>> >> >> >> these is also called "document vectors" ?)
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>>>>>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>> There is an additional step that can be run to produce document
>>>>>>>> >> >> vectors.
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>>>>>>>> >
>>>>>>>> >> >> wrote:
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>>>>>>> >> >> >>>>
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> --
>>>>>>>> >> >> >> Stefan Wienert
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> http://www.wienert.cc
>>>>>>>> >> >> >> stefan@wienert.cc
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> Telefon: +495251-2026838
>>>>>>>> >> >> >> Mobil: +49176-40170270
>>>>>>>> >> >> >>
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> > --
>>>>>>>> >> >> > Stefan Wienert
>>>>>>>> >> >> >
>>>>>>>> >> >> > http://www.wienert.cc
>>>>>>>> >> >> > stefan@wienert.cc
>>>>>>>> >> >> >
>>>>>>>> >> >> > Telefon: +495251-2026838
>>>>>>>> >> >> > Mobil: +49176-40170270
>>>>>>>> >> >> >
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> --
>>>>>>>> >> >> Stefan Wienert
>>>>>>>> >> >>
>>>>>>>> >> >> http://www.wienert.cc
>>>>>>>> >> >> stefan@wienert.cc
>>>>>>>> >> >>
>>>>>>>> >> >> Telefon: +495251-2026838
>>>>>>>> >> >> Mobil: +49176-40170270
>>>>>>>> >> >>
>>>>>>>> >> >
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >> Stefan Wienert
>>>>>>>> >>
>>>>>>>> >> http://www.wienert.cc
>>>>>>>> >> stefan@wienert.cc
>>>>>>>> >>
>>>>>>>> >> Telefon: +495251-2026838
>>>>>>>> >> Mobil: +49176-40170270
>>>>>>>> >>
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefan Wienert
>>>>>>>>
>>>>>>>> http://www.wienert.cc
>>>>>>>> stefan@wienert.cc
>>>>>>>>
>>>>>>>> Telefon: +495251-2026838
>>>>>>>> Mobil: +49176-40170270
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> stefan@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
You can check what i summed up in javadoc for SSVDSolver to get some idea.

but on top what i have already mentioned, the main difference is that
it presumably doesn't require as much resources (reuters dataset runs
thru with default -Xmx200m per task setting just fine) and theoretical
flops are quite less than that of Lanczos which probably means it
should rip thru the same volume faster. (my practical run showed that
it takes about the same time as seq2sparse that produced it's input,
and that was before sparse vector bug fix which caused significant
slow down on sparse data).

I don't think you should 'modify' your progam, especially if you need
to show practical result within a week. but you might help me by
creating a version of your program that takes it and we can see
together what it takes to get it working for you.

-d

On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <st...@wienert.cc> wrote:
> Before I rewrite my program, is there any advantage over the lanczos svd?
>
> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>> I am saying i did not test it with 0.20.2
>>
>> Yes it is integrated in 0.5 release but there might be problems with
>> hadoop 0.20.2
>>
>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>> Hmm... looks nice...
>>>
>>> So there is a Lanczos implementation of SVD and the stochastic version
>>> of SVD. Both produce the doc-concept vectors that I need.
>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have
>>> to do some magic to get IntWritable, VectorWritable).
>>>
>>> Still, I do not exactly understand what you trying to say. Your SSVD
>>> does not run with mahout but on CDH (what is this btw?)? Or is it
>>> available for mahout? So what has to be modified to run it with
>>> mahout?
>>>
>>> Thanks
>>> Stefan
>>>
>>>
>>>
>>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>>>> produce exact U, V and Sigma ,
>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>>>
>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>>>> necessarily DoubleWritable (you can still keep your original document
>>>> paths or whatever they are identified with), that info is migrated as
>>>> keys of the output of U matrix. So you can run it directly on the
>>>> output of the Mahout's seq2sparse (I actually tested that on reuters
>>>> dataset example from Mahout In Action).
>>>>
>>>> The computation has a stochastic noise to it, but LSI problems are
>>>> never exact problems anyway and their result are subject to corpus
>>>> selection.
>>>>
>>>> If you are interested to help me to work kinks out in Mahout's
>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>>>> CDH in a customized way. But be warned that it may require a number of
>>>> patches before it works for you.
>>>>
>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>>>
>>>> Thanks.
>>>>
>>>> -D
>>>>
>>>>
>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>>>
>>>>> And now your questions:
>>>>>
>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>>>> similarity to find similar documents. I can also cluster these
>>>>> documents with k-means...
>>>>>
>>>>> If you have any suggestions, feel free to tell me. I'm also interested
>>>>> in other document-similarity-techniques.
>>>>>
>>>>> Cheers,
>>>>> Stefan
>>>>>
>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>> If I understand your question correctly, you need to simply transpose M
>>>>>> before you start the run and that way
>>>>>> you will get the other singular vectors.
>>>>>>
>>>>>> May I ask what is the problem you are working on and why do you need the
>>>>>> singular vectors?
>>>>>> Can you consider using another matrix decomposition technique for example
>>>>>> alternating least squares
>>>>>> which gives you two lower rank matrices which simulates the large decomposed
>>>>>> matrix?
>>>>>>
>>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>>
>>>>>>> Hi Danny!
>>>>>>>
>>>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>>>> are the left singular vectors of M. But I need the right singular
>>>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>>>
>>>>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>>>>> can help me!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Stefan
>>>>>>>
>>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>> > Hi Stefan!
>>>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>>>>>> > identical.
>>>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>>>> decomposition.
>>>>>>> > So you don't need to be worried about the other singular vectors.
>>>>>>> >
>>>>>>> > Hope this helps!
>>>>>>> >
>>>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> >> Hi.
>>>>>>> >>
>>>>>>> >> Thanks for the help.
>>>>>>> >>
>>>>>>> >> The important points from wikipedia are:
>>>>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>>>> >>
>>>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>>>>> >> calculate the right (or left) singular vector of M.
>>>>>>> >>
>>>>>>> >> But my question is, how can I get the other singular vector? I can
>>>>>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>>>>>> >> and one for the left singular value... I think there is a better way
>>>>>>> >> :)
>>>>>>> >>
>>>>>>> >> Hope you can help me with this...
>>>>>>> >> Thanks
>>>>>>> >> Stefan
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>>> >> > Hi
>>>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>>>>>>> >> > computed by A=M'*M
>>>>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>>>>>> >> > computed such that A =~ V*T*V'
>>>>>>> >> > (this process is done in a distributed way).
>>>>>>> >> >
>>>>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>>>>> eignevalues.
>>>>>>> >> > Which is the returned result. (This process
>>>>>>> >> > is serial).
>>>>>>> >> >
>>>>>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>>>>>> other
>>>>>>> >> > (which is optional IMHO).
>>>>>>> >> >
>>>>>>> >> > The heart of the code is found at:
>>>>>>> >> >
>>>>>>> >>
>>>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>>>>>> are
>>>>>>> >> > changes in version 0.5
>>>>>>> >> >
>>>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>>>>>>> >> > learning more about the relation to SVD
>>>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>>>> >> >
>>>>>>> >> > Hope this helps,
>>>>>>> >> >
>>>>>>> >> > Danny Bickson
>>>>>>> >> >
>>>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>>>>>>> >> wrote:
>>>>>>> >> >
>>>>>>> >> >> After reading this thread:
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>>>> >> >>
>>>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>>>> >> >>
>>>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>>>> >> >>
>>>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>>>> >> >>
>>>>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>>>>>>> >> is)?
>>>>>>> >> >>
>>>>>>> >> >> Thanks
>>>>>>> >> >> Stefan
>>>>>>> >> >>
>>>>>>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>> >> >> >
>>>>>>> >>
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>> >> >> >
>>>>>>> >> >> > What is done:
>>>>>>> >> >> >
>>>>>>> >> >> > Input:
>>>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>>>> >> >> >
>>>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>>>> >> >> >
>>>>>>> >> >> > transpose tf-idf-matrix:
>>>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>>>>> >> >> >
>>>>>>> >> >> > transpose svd:
>>>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>>>> >> >> >
>>>>>>> >> >> > matrix multiply:
>>>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>>>>> >> >> >
>>>>>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>>>>>> wikipedia.
>>>>>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>>>>>> the
>>>>>>> >> >> >> Instead either the left or right (but usually the right)
>>>>>>> eigenvectors
>>>>>>> >> >> premultiplied by the diagonal or the square root of the
>>>>>>> >> >> >> diagonal element.
>>>>>>> >> >> > from
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>>>> >> >> >
>>>>>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>>>>>> do I
>>>>>>> >> >> > have to calculate to get the "right singular value" from svd?
>>>>>>> >> >> >
>>>>>>> >> >> > Thanks,
>>>>>>> >> >> > Stefan
>>>>>>> >> >> >
>>>>>>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>>> >> >> >>
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>> >> >> >>
>>>>>>> >> >> >> the last step is the matrix multiplication:
>>>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>>>> >> >> >>
>>>>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>>>>>>> >> >> >> matrix multiplication has to be the right singular value regarding
>>>>>>> to
>>>>>>> >> >> >> the dimensions.
>>>>>>> >> >> >>
>>>>>>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>>>>>>> >> >> >> these is also called "document vectors" ?)
>>>>>>> >> >> >>
>>>>>>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>>>>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>>>>>> >> >> >>>
>>>>>>> >> >> >>> There is an additional step that can be run to produce document
>>>>>>> >> >> vectors.
>>>>>>> >> >> >>>
>>>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>>>>>>> >
>>>>>>> >> >> wrote:
>>>>>>> >> >> >>>
>>>>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>>>>>> >> >> >>>>
>>>>>>> >> >> >>>
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> --
>>>>>>> >> >> >> Stefan Wienert
>>>>>>> >> >> >>
>>>>>>> >> >> >> http://www.wienert.cc
>>>>>>> >> >> >> stefan@wienert.cc
>>>>>>> >> >> >>
>>>>>>> >> >> >> Telefon: +495251-2026838
>>>>>>> >> >> >> Mobil: +49176-40170270
>>>>>>> >> >> >>
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> > --
>>>>>>> >> >> > Stefan Wienert
>>>>>>> >> >> >
>>>>>>> >> >> > http://www.wienert.cc
>>>>>>> >> >> > stefan@wienert.cc
>>>>>>> >> >> >
>>>>>>> >> >> > Telefon: +495251-2026838
>>>>>>> >> >> > Mobil: +49176-40170270
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> --
>>>>>>> >> >> Stefan Wienert
>>>>>>> >> >>
>>>>>>> >> >> http://www.wienert.cc
>>>>>>> >> >> stefan@wienert.cc
>>>>>>> >> >>
>>>>>>> >> >> Telefon: +495251-2026838
>>>>>>> >> >> Mobil: +49176-40170270
>>>>>>> >> >>
>>>>>>> >> >
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> Stefan Wienert
>>>>>>> >>
>>>>>>> >> http://www.wienert.cc
>>>>>>> >> stefan@wienert.cc
>>>>>>> >>
>>>>>>> >> Telefon: +495251-2026838
>>>>>>> >> Mobil: +49176-40170270
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> stefan@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Wienert
>>>>>
>>>>> http://www.wienert.cc
>>>>> stefan@wienert.cc
>>>>>
>>>>> Telefon: +495251-2026838
>>>>> Mobil: +49176-40170270
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
Before I rewrite my program, is there any advantage over the lanczos svd?

2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
> I am saying i did not test it with 0.20.2
>
> Yes it is integrated in 0.5 release but there might be problems with
> hadoop 0.20.2
>
> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <st...@wienert.cc> wrote:
>> Hmm... looks nice...
>>
>> So there is a Lanczos implementation of SVD and the stochastic version
>> of SVD. Both produce the doc-concept vectors that I need.
>> I acctualy get my TFIDF Vectors directly from a lucene index (and have
>> to do some magic to get IntWritable, VectorWritable).
>>
>> Still, I do not exactly understand what you trying to say. Your SSVD
>> does not run with mahout but on CDH (what is this btw?)? Or is it
>> available for mahout? So what has to be modified to run it with
>> mahout?
>>
>> Thanks
>> Stefan
>>
>>
>>
>> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>>> produce exact U, V and Sigma ,
>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>>
>>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>>> necessarily DoubleWritable (you can still keep your original document
>>> paths or whatever they are identified with), that info is migrated as
>>> keys of the output of U matrix. So you can run it directly on the
>>> output of the Mahout's seq2sparse (I actually tested that on reuters
>>> dataset example from Mahout In Action).
>>>
>>> The computation has a stochastic noise to it, but LSI problems are
>>> never exact problems anyway and their result are subject to corpus
>>> selection.
>>>
>>> If you are interested to help me to work kinks out in Mahout's
>>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>>> CDH in a customized way. But be warned that it may require a number of
>>> patches before it works for you.
>>>
>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>>
>>> Thanks.
>>>
>>> -D
>>>
>>>
>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>>
>>>> And now your questions:
>>>>
>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>>> similarity to find similar documents. I can also cluster these
>>>> documents with k-means...
>>>>
>>>> If you have any suggestions, feel free to tell me. I'm also interested
>>>> in other document-similarity-techniques.
>>>>
>>>> Cheers,
>>>> Stefan
>>>>
>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>> If I understand your question correctly, you need to simply transpose M
>>>>> before you start the run and that way
>>>>> you will get the other singular vectors.
>>>>>
>>>>> May I ask what is the problem you are working on and why do you need the
>>>>> singular vectors?
>>>>> Can you consider using another matrix decomposition technique for example
>>>>> alternating least squares
>>>>> which gives you two lower rank matrices which simulates the large decomposed
>>>>> matrix?
>>>>>
>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>>
>>>>>> Hi Danny!
>>>>>>
>>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>>> are the left singular vectors of M. But I need the right singular
>>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>>
>>>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>>>> can help me!
>>>>>>
>>>>>> Thanks,
>>>>>> Stefan
>>>>>>
>>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>> > Hi Stefan!
>>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>>>>> > identical.
>>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>>> decomposition.
>>>>>> > So you don't need to be worried about the other singular vectors.
>>>>>> >
>>>>>> > Hope this helps!
>>>>>> >
>>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>>>>>> wrote:
>>>>>> >
>>>>>> >> Hi.
>>>>>> >>
>>>>>> >> Thanks for the help.
>>>>>> >>
>>>>>> >> The important points from wikipedia are:
>>>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>>> >>
>>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>>>> >> calculate the right (or left) singular vector of M.
>>>>>> >>
>>>>>> >> But my question is, how can I get the other singular vector? I can
>>>>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>>>>> >> and one for the left singular value... I think there is a better way
>>>>>> >> :)
>>>>>> >>
>>>>>> >> Hope you can help me with this...
>>>>>> >> Thanks
>>>>>> >> Stefan
>>>>>> >>
>>>>>> >>
>>>>>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>>> >> > Hi
>>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>>>>>> >> > computed by A=M'*M
>>>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>>>>> >> > computed such that A =~ V*T*V'
>>>>>> >> > (this process is done in a distributed way).
>>>>>> >> >
>>>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>>>> eignevalues.
>>>>>> >> > Which is the returned result. (This process
>>>>>> >> > is serial).
>>>>>> >> >
>>>>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>>>>> other
>>>>>> >> > (which is optional IMHO).
>>>>>> >> >
>>>>>> >> > The heart of the code is found at:
>>>>>> >> >
>>>>>> >>
>>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>>>>> are
>>>>>> >> > changes in version 0.5
>>>>>> >> >
>>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>>>>>> >> > learning more about the relation to SVD
>>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>>> >> >
>>>>>> >> > Hope this helps,
>>>>>> >> >
>>>>>> >> > Danny Bickson
>>>>>> >> >
>>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>>>>>> >> wrote:
>>>>>> >> >
>>>>>> >> >> After reading this thread:
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>>> >> >>
>>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>>> >> >>
>>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>>> >> >>
>>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>>> >> >>
>>>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>>>>>> >> is)?
>>>>>> >> >>
>>>>>> >> >> Thanks
>>>>>> >> >> Stefan
>>>>>> >> >>
>>>>>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>> >> >> >
>>>>>> >>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>> >> >> >
>>>>>> >> >> > What is done:
>>>>>> >> >> >
>>>>>> >> >> > Input:
>>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>>> >> >> >
>>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>>> >> >> >
>>>>>> >> >> > transpose tf-idf-matrix:
>>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>>>> >> >> >
>>>>>> >> >> > transpose svd:
>>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>>> >> >> >
>>>>>> >> >> > matrix multiply:
>>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>>>> >> >> >
>>>>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>>>>> wikipedia.
>>>>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>>>>> the
>>>>>> >> >> >> Instead either the left or right (but usually the right)
>>>>>> eigenvectors
>>>>>> >> >> premultiplied by the diagonal or the square root of the
>>>>>> >> >> >> diagonal element.
>>>>>> >> >> > from
>>>>>> >> >>
>>>>>> >>
>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>>> >> >> >
>>>>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>>>>> do I
>>>>>> >> >> > have to calculate to get the "right singular value" from svd?
>>>>>> >> >> >
>>>>>> >> >> > Thanks,
>>>>>> >> >> > Stefan
>>>>>> >> >> >
>>>>>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>>> >> >> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>> >> >> >>
>>>>>> >> >> >> the last step is the matrix multiplication:
>>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>>> >> >> >>
>>>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>>>>>> >> >> >> matrix multiplication has to be the right singular value regarding
>>>>>> to
>>>>>> >> >> >> the dimensions.
>>>>>> >> >> >>
>>>>>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>>>>>> >> >> >> these is also called "document vectors" ?)
>>>>>> >> >> >>
>>>>>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>>>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>>>>> >> >> >>>
>>>>>> >> >> >>> There is an additional step that can be run to produce document
>>>>>> >> >> vectors.
>>>>>> >> >> >>>
>>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>>>>>> >
>>>>>> >> >> wrote:
>>>>>> >> >> >>>
>>>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>>>>> >> >> >>>>
>>>>>> >> >> >>>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> --
>>>>>> >> >> >> Stefan Wienert
>>>>>> >> >> >>
>>>>>> >> >> >> http://www.wienert.cc
>>>>>> >> >> >> stefan@wienert.cc
>>>>>> >> >> >>
>>>>>> >> >> >> Telefon: +495251-2026838
>>>>>> >> >> >> Mobil: +49176-40170270
>>>>>> >> >> >>
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > --
>>>>>> >> >> > Stefan Wienert
>>>>>> >> >> >
>>>>>> >> >> > http://www.wienert.cc
>>>>>> >> >> > stefan@wienert.cc
>>>>>> >> >> >
>>>>>> >> >> > Telefon: +495251-2026838
>>>>>> >> >> > Mobil: +49176-40170270
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> --
>>>>>> >> >> Stefan Wienert
>>>>>> >> >>
>>>>>> >> >> http://www.wienert.cc
>>>>>> >> >> stefan@wienert.cc
>>>>>> >> >>
>>>>>> >> >> Telefon: +495251-2026838
>>>>>> >> >> Mobil: +49176-40170270
>>>>>> >> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Stefan Wienert
>>>>>> >>
>>>>>> >> http://www.wienert.cc
>>>>>> >> stefan@wienert.cc
>>>>>> >>
>>>>>> >> Telefon: +495251-2026838
>>>>>> >> Mobil: +49176-40170270
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> stefan@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I am saying i did not test it with 0.20.2

Yes it is integrated in 0.5 release but there might be problems with
hadoop 0.20.2

On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <st...@wienert.cc> wrote:
> Hmm... looks nice...
>
> So there is a Lanczos implementation of SVD and the stochastic version
> of SVD. Both produce the doc-concept vectors that I need.
> I acctualy get my TFIDF Vectors directly from a lucene index (and have
> to do some magic to get IntWritable, VectorWritable).
>
> Still, I do not exactly understand what you trying to say. Your SSVD
> does not run with mahout but on CDH (what is this btw?)? Or is it
> available for mahout? So what has to be modified to run it with
> mahout?
>
> Thanks
> Stefan
>
>
>
> 2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>> produce exact U, V and Sigma ,
>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>
>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>> necessarily DoubleWritable (you can still keep your original document
>> paths or whatever they are identified with), that info is migrated as
>> keys of the output of U matrix. So you can run it directly on the
>> output of the Mahout's seq2sparse (I actually tested that on reuters
>> dataset example from Mahout In Action).
>>
>> The computation has a stochastic noise to it, but LSI problems are
>> never exact problems anyway and their result are subject to corpus
>> selection.
>>
>> If you are interested to help me to work kinks out in Mahout's
>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>> CDH in a customized way. But be warned that it may require a number of
>> patches before it works for you.
>>
>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>
>> Thanks.
>>
>> -D
>>
>>
>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>
>>> And now your questions:
>>>
>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>> similarity to find similar documents. I can also cluster these
>>> documents with k-means...
>>>
>>> If you have any suggestions, feel free to tell me. I'm also interested
>>> in other document-similarity-techniques.
>>>
>>> Cheers,
>>> Stefan
>>>
>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>> If I understand your question correctly, you need to simply transpose M
>>>> before you start the run and that way
>>>> you will get the other singular vectors.
>>>>
>>>> May I ask what is the problem you are working on and why do you need the
>>>> singular vectors?
>>>> Can you consider using another matrix decomposition technique for example
>>>> alternating least squares
>>>> which gives you two lower rank matrices which simulates the large decomposed
>>>> matrix?
>>>>
>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>>
>>>>> Hi Danny!
>>>>>
>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>> are the left singular vectors of M. But I need the right singular
>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>
>>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>>> can help me!
>>>>>
>>>>> Thanks,
>>>>> Stefan
>>>>>
>>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>> > Hi Stefan!
>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>>>> > identical.
>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>> decomposition.
>>>>> > So you don't need to be worried about the other singular vectors.
>>>>> >
>>>>> > Hope this helps!
>>>>> >
>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>>>>> wrote:
>>>>> >
>>>>> >> Hi.
>>>>> >>
>>>>> >> Thanks for the help.
>>>>> >>
>>>>> >> The important points from wikipedia are:
>>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>> >>
>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>>> >> calculate the right (or left) singular vector of M.
>>>>> >>
>>>>> >> But my question is, how can I get the other singular vector? I can
>>>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>>>> >> and one for the left singular value... I think there is a better way
>>>>> >> :)
>>>>> >>
>>>>> >> Hope you can help me with this...
>>>>> >> Thanks
>>>>> >> Stefan
>>>>> >>
>>>>> >>
>>>>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>>> >> > Hi
>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>>>>> >> > computed by A=M'*M
>>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>>>> >> > computed such that A =~ V*T*V'
>>>>> >> > (this process is done in a distributed way).
>>>>> >> >
>>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>>> eignevalues.
>>>>> >> > Which is the returned result. (This process
>>>>> >> > is serial).
>>>>> >> >
>>>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>>>> other
>>>>> >> > (which is optional IMHO).
>>>>> >> >
>>>>> >> > The heart of the code is found at:
>>>>> >> >
>>>>> >>
>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>>>> are
>>>>> >> > changes in version 0.5
>>>>> >> >
>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>>>>> >> > learning more about the relation to SVD
>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>> >> >
>>>>> >> > Hope this helps,
>>>>> >> >
>>>>> >> > Danny Bickson
>>>>> >> >
>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>>>>> >> wrote:
>>>>> >> >
>>>>> >> >> After reading this thread:
>>>>> >> >>
>>>>> >> >>
>>>>> >>
>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>> >> >>
>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>> >> >>
>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>> >> >>
>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>> >> >>
>>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>>>>> >> is)?
>>>>> >> >>
>>>>> >> >> Thanks
>>>>> >> >> Stefan
>>>>> >> >>
>>>>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>> >> >> >
>>>>> >>
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>> >> >> >
>>>>> >> >> > What is done:
>>>>> >> >> >
>>>>> >> >> > Input:
>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>> >> >> >
>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>> >> >> >
>>>>> >> >> > transpose tf-idf-matrix:
>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>>> >> >> >
>>>>> >> >> > transpose svd:
>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>> >> >> >
>>>>> >> >> > matrix multiply:
>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>>> >> >> >
>>>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>>>> wikipedia.
>>>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>>>> the
>>>>> >> >> >> Instead either the left or right (but usually the right)
>>>>> eigenvectors
>>>>> >> >> premultiplied by the diagonal or the square root of the
>>>>> >> >> >> diagonal element.
>>>>> >> >> > from
>>>>> >> >>
>>>>> >>
>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>> >> >> >
>>>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>>>> do I
>>>>> >> >> > have to calculate to get the "right singular value" from svd?
>>>>> >> >> >
>>>>> >> >> > Thanks,
>>>>> >> >> > Stefan
>>>>> >> >> >
>>>>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>>> >> >> >>
>>>>> >> >>
>>>>> >>
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>> >> >> >>
>>>>> >> >> >> the last step is the matrix multiplication:
>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>> >> >> >>
>>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>>>>> >> >> >> matrix multiplication has to be the right singular value regarding
>>>>> to
>>>>> >> >> >> the dimensions.
>>>>> >> >> >>
>>>>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>>>>> >> >> >> these is also called "document vectors" ?)
>>>>> >> >> >>
>>>>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>>>> >> >> >>>
>>>>> >> >> >>> There is an additional step that can be run to produce document
>>>>> >> >> vectors.
>>>>> >> >> >>>
>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>>>>> >
>>>>> >> >> wrote:
>>>>> >> >> >>>
>>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>>>> >> >> >>>>
>>>>> >> >> >>>
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >> --
>>>>> >> >> >> Stefan Wienert
>>>>> >> >> >>
>>>>> >> >> >> http://www.wienert.cc
>>>>> >> >> >> stefan@wienert.cc
>>>>> >> >> >>
>>>>> >> >> >> Telefon: +495251-2026838
>>>>> >> >> >> Mobil: +49176-40170270
>>>>> >> >> >>
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > --
>>>>> >> >> > Stefan Wienert
>>>>> >> >> >
>>>>> >> >> > http://www.wienert.cc
>>>>> >> >> > stefan@wienert.cc
>>>>> >> >> >
>>>>> >> >> > Telefon: +495251-2026838
>>>>> >> >> > Mobil: +49176-40170270
>>>>> >> >> >
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> --
>>>>> >> >> Stefan Wienert
>>>>> >> >>
>>>>> >> >> http://www.wienert.cc
>>>>> >> >> stefan@wienert.cc
>>>>> >> >>
>>>>> >> >> Telefon: +495251-2026838
>>>>> >> >> Mobil: +49176-40170270
>>>>> >> >>
>>>>> >> >
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Stefan Wienert
>>>>> >>
>>>>> >> http://www.wienert.cc
>>>>> >> stefan@wienert.cc
>>>>> >>
>>>>> >> Telefon: +495251-2026838
>>>>> >> Mobil: +49176-40170270
>>>>> >>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Wienert
>>>>>
>>>>> http://www.wienert.cc
>>>>> stefan@wienert.cc
>>>>>
>>>>> Telefon: +495251-2026838
>>>>> Mobil: +49176-40170270
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
Hmm... looks nice...

So there is a Lanczos implementation of SVD and the stochastic version
of SVD. Both produce the doc-concept vectors that I need.
I acctualy get my TFIDF Vectors directly from a lucene index (and have
to do some magic to get IntWritable, VectorWritable).

Still, I do not exactly understand what you trying to say. Your SSVD
does not run with mahout but on CDH (what is this btw?)? Or is it
available for mahout? So what has to be modified to run it with
mahout?

Thanks
Stefan



2011/6/7 Dmitriy Lyubimov <dl...@gmail.com>:
> I also do LSI/LSA and wrote a stochastic svd that is capable to
> produce exact U, V and Sigma ,
> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>
> On top of that, it is adjusted to be LSI-friendly if your keys are not
> necessarily DoubleWritable (you can still keep your original document
> paths or whatever they are identified with), that info is migrated as
> keys of the output of U matrix. So you can run it directly on the
> output of the Mahout's seq2sparse (I actually tested that on reuters
> dataset example from Mahout In Action).
>
> The computation has a stochastic noise to it, but LSI problems are
> never exact problems anyway and their result are subject to corpus
> selection.
>
> If you are interested to help me to work kinks out in Mahout's
> version, I'd be grateful since i don't run the method on 0.20.2 but on
> CDH in a customized way. But be warned that it may require a number of
> patches before it works for you.
>
> Here is (a little bit too wordy) command line manual for Mahout 0.5.
> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>
> Thanks.
>
> -D
>
>
> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:
>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>
>> And now your questions:
>>
>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>> Document-Concept-Vectors. These Vectors will be compared with cosine
>> similarity to find similar documents. I can also cluster these
>> documents with k-means...
>>
>> If you have any suggestions, feel free to tell me. I'm also interested
>> in other document-similarity-techniques.
>>
>> Cheers,
>> Stefan
>>
>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>> If I understand your question correctly, you need to simply transpose M
>>> before you start the run and that way
>>> you will get the other singular vectors.
>>>
>>> May I ask what is the problem you are working on and why do you need the
>>> singular vectors?
>>> Can you consider using another matrix decomposition technique for example
>>> alternating least squares
>>> which gives you two lower rank matrices which simulates the large decomposed
>>> matrix?
>>>
>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>
>>>> Hi Danny!
>>>>
>>>> I understand that for M*M' (and for M'*M) the left and right
>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>> are the left singular vectors of M. But I need the right singular
>>>> vectors of M (and not M*M'). How do I get them?
>>>>
>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>> can help me!
>>>>
>>>> Thanks,
>>>> Stefan
>>>>
>>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>> > Hi Stefan!
>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>>> > identical.
>>>> > See SVD wikipeida text: When *M* is also positive
>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>> decomposition.
>>>> > So you don't need to be worried about the other singular vectors.
>>>> >
>>>> > Hope this helps!
>>>> >
>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>>>> wrote:
>>>> >
>>>> >> Hi.
>>>> >>
>>>> >> Thanks for the help.
>>>> >>
>>>> >> The important points from wikipedia are:
>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>> >>
>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>> >> calculate the right (or left) singular vector of M.
>>>> >>
>>>> >> But my question is, how can I get the other singular vector? I can
>>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>>> >> and one for the left singular value... I think there is a better way
>>>> >> :)
>>>> >>
>>>> >> Hope you can help me with this...
>>>> >> Thanks
>>>> >> Stefan
>>>> >>
>>>> >>
>>>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>>> >> > Hi
>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>>>> >> > computed by A=M'*M
>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>>> >> > computed such that A =~ V*T*V'
>>>> >> > (this process is done in a distributed way).
>>>> >> >
>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>> eignevalues.
>>>> >> > Which is the returned result. (This process
>>>> >> > is serial).
>>>> >> >
>>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>>> other
>>>> >> > (which is optional IMHO).
>>>> >> >
>>>> >> > The heart of the code is found at:
>>>> >> >
>>>> >>
>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>>> are
>>>> >> > changes in version 0.5
>>>> >> >
>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>>>> >> > learning more about the relation to SVD
>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>> >> > subsection: relation to eigenvalue decomposition.
>>>> >> >
>>>> >> > Hope this helps,
>>>> >> >
>>>> >> > Danny Bickson
>>>> >> >
>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>>>> >> wrote:
>>>> >> >
>>>> >> >> After reading this thread:
>>>> >> >>
>>>> >> >>
>>>> >>
>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>> >> >>
>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>> >> >>
>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>> >> >>
>>>> >> >> So... How do I get V from (U S)  and M?
>>>> >> >>
>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>>>> >> is)?
>>>> >> >>
>>>> >> >> Thanks
>>>> >> >> Stefan
>>>> >> >>
>>>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>> >> >> >
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>> >> >> >
>>>> >> >> > What is done:
>>>> >> >> >
>>>> >> >> > Input:
>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>> >> >> >
>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>> >> >> >
>>>> >> >> > transpose tf-idf-matrix:
>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>> >> >> >
>>>> >> >> > transpose svd:
>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>> >> >> >
>>>> >> >> > matrix multiply:
>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>> >> >> >
>>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>>> wikipedia.
>>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>>> the
>>>> >> >> >> Instead either the left or right (but usually the right)
>>>> eigenvectors
>>>> >> >> premultiplied by the diagonal or the square root of the
>>>> >> >> >> diagonal element.
>>>> >> >> > from
>>>> >> >>
>>>> >>
>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>> >> >> >
>>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>>> do I
>>>> >> >> > have to calculate to get the "right singular value" from svd?
>>>> >> >> >
>>>> >> >> > Thanks,
>>>> >> >> > Stefan
>>>> >> >> >
>>>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>>> >> >> >>
>>>> >> >>
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>> >> >> >>
>>>> >> >> >> the last step is the matrix multiplication:
>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>> >> >> >>
>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>>>> >> >> >> matrix multiplication has to be the right singular value regarding
>>>> to
>>>> >> >> >> the dimensions.
>>>> >> >> >>
>>>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>>>> >> >> >> these is also called "document vectors" ?)
>>>> >> >> >>
>>>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>>> >> >> >>>
>>>> >> >> >>> There is an additional step that can be run to produce document
>>>> >> >> vectors.
>>>> >> >> >>>
>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>>>> >
>>>> >> >> wrote:
>>>> >> >> >>>
>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>>> >> >> >>>>
>>>> >> >> >>>
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >> --
>>>> >> >> >> Stefan Wienert
>>>> >> >> >>
>>>> >> >> >> http://www.wienert.cc
>>>> >> >> >> stefan@wienert.cc
>>>> >> >> >>
>>>> >> >> >> Telefon: +495251-2026838
>>>> >> >> >> Mobil: +49176-40170270
>>>> >> >> >>
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Stefan Wienert
>>>> >> >> >
>>>> >> >> > http://www.wienert.cc
>>>> >> >> > stefan@wienert.cc
>>>> >> >> >
>>>> >> >> > Telefon: +495251-2026838
>>>> >> >> > Mobil: +49176-40170270
>>>> >> >> >
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> --
>>>> >> >> Stefan Wienert
>>>> >> >>
>>>> >> >> http://www.wienert.cc
>>>> >> >> stefan@wienert.cc
>>>> >> >>
>>>> >> >> Telefon: +495251-2026838
>>>> >> >> Mobil: +49176-40170270
>>>> >> >>
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Stefan Wienert
>>>> >>
>>>> >> http://www.wienert.cc
>>>> >> stefan@wienert.cc
>>>> >>
>>>> >> Telefon: +495251-2026838
>>>> >> Mobil: +49176-40170270
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I also do LSI/LSA and wrote a stochastic svd that is capable to
produce exact U, V and Sigma ,
or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

On top of that, it is adjusted to be LSI-friendly if your keys are not
necessarily DoubleWritable (you can still keep your original document
paths or whatever they are identified with), that info is migrated as
keys of the output of U matrix. So you can run it directly on the
output of the Mahout's seq2sparse (I actually tested that on reuters
dataset example from Mahout In Action).

The computation has a stochastic noise to it, but LSI problems are
never exact problems anyway and their result are subject to corpus
selection.

If you are interested to help me to work kinks out in Mahout's
version, I'd be grateful since i don't run the method on 0.20.2 but on
CDH in a customized way. But be warned that it may require a number of
patches before it works for you.

Here is (a little bit too wordy) command line manual for Mahout 0.5.
http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

Thanks.

-D


On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:
> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>
> And now your questions:
>
> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
> Document-Term-Matrix and reduce it using Lanczos. So I get the
> Document-Concept-Vectors. These Vectors will be compared with cosine
> similarity to find similar documents. I can also cluster these
> documents with k-means...
>
> If you have any suggestions, feel free to tell me. I'm also interested
> in other document-similarity-techniques.
>
> Cheers,
> Stefan
>
> 2011/6/6 Danny Bickson <da...@gmail.com>:
>> If I understand your question correctly, you need to simply transpose M
>> before you start the run and that way
>> you will get the other singular vectors.
>>
>> May I ask what is the problem you are working on and why do you need the
>> singular vectors?
>> Can you consider using another matrix decomposition technique for example
>> alternating least squares
>> which gives you two lower rank matrices which simulates the large decomposed
>> matrix?
>>
>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>
>>> Hi Danny!
>>>
>>> I understand that for M*M' (and for M'*M) the left and right
>>> eigenvectors are identical. But that is not exactly what I want. The
>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>> are the left singular vectors of M. But I need the right singular
>>> vectors of M (and not M*M'). How do I get them?
>>>
>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>> can help me!
>>>
>>> Thanks,
>>> Stefan
>>>
>>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>> > Hi Stefan!
>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>> > identical.
>>> > See SVD wikipeida text: When *M* is also positive
>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>> decomposition.
>>> > So you don't need to be worried about the other singular vectors.
>>> >
>>> > Hope this helps!
>>> >
>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>>> wrote:
>>> >
>>> >> Hi.
>>> >>
>>> >> Thanks for the help.
>>> >>
>>> >> The important points from wikipedia are:
>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>> >>
>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>> >> calculate the right (or left) singular vector of M.
>>> >>
>>> >> But my question is, how can I get the other singular vector? I can
>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>> >> and one for the left singular value... I think there is a better way
>>> >> :)
>>> >>
>>> >> Hope you can help me with this...
>>> >> Thanks
>>> >> Stefan
>>> >>
>>> >>
>>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>>> >> > Hi
>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>>> >> > computed by A=M'*M
>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>> >> > computed such that A =~ V*T*V'
>>> >> > (this process is done in a distributed way).
>>> >> >
>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>> eignevalues.
>>> >> > Which is the returned result. (This process
>>> >> > is serial).
>>> >> >
>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>> other
>>> >> > (which is optional IMHO).
>>> >> >
>>> >> > The heart of the code is found at:
>>> >> >
>>> >>
>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>> are
>>> >> > changes in version 0.5
>>> >> >
>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>>> >> > learning more about the relation to SVD
>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>> >> > subsection: relation to eigenvalue decomposition.
>>> >> >
>>> >> > Hope this helps,
>>> >> >
>>> >> > Danny Bickson
>>> >> >
>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>>> >> wrote:
>>> >> >
>>> >> >> After reading this thread:
>>> >> >>
>>> >> >>
>>> >>
>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>> >> >>
>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>> >> >>
>>> >> >> The output of Mahout-SVD is (U S) right?
>>> >> >>
>>> >> >> So... How do I get V from (U S)  and M?
>>> >> >>
>>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>>> >> is)?
>>> >> >>
>>> >> >> Thanks
>>> >> >> Stefan
>>> >> >>
>>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>> >> >> >
>>> >>
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>> >> >> >
>>> >> >> > What is done:
>>> >> >> >
>>> >> >> > Input:
>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>> >> >> >
>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>> >> >> > svd (concepts x terms) 87 x 20444
>>> >> >> >
>>> >> >> > transpose tf-idf-matrix:
>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>> >> >> >
>>> >> >> > transpose svd:
>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>> >> >> >
>>> >> >> > matrix multiply:
>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>> >> >> >
>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>> wikipedia.
>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>> the
>>> >> >> >> Instead either the left or right (but usually the right)
>>> eigenvectors
>>> >> >> premultiplied by the diagonal or the square root of the
>>> >> >> >> diagonal element.
>>> >> >> > from
>>> >> >>
>>> >>
>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>> >> >> >
>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>> do I
>>> >> >> > have to calculate to get the "right singular value" from svd?
>>> >> >> >
>>> >> >> > Thanks,
>>> >> >> > Stefan
>>> >> >> >
>>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>>> >> >> >>
>>> >> >>
>>> >>
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>> >> >> >>
>>> >> >> >> the last step is the matrix multiplication:
>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>> >> >> >>  --arg --numColsB --arg 87 \
>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>> >> >> >>
>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>>> >> >> >> matrix multiplication has to be the right singular value regarding
>>> to
>>> >> >> >> the dimensions.
>>> >> >> >>
>>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>>> >> >> >> these is also called "document vectors" ?)
>>> >> >> >>
>>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>> >> >> >>>
>>> >> >> >>> There is an additional step that can be run to produce document
>>> >> >> vectors.
>>> >> >> >>>
>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>>> >
>>> >> >> wrote:
>>> >> >> >>>
>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>> >> >> >>>>
>>> >> >> >>>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> --
>>> >> >> >> Stefan Wienert
>>> >> >> >>
>>> >> >> >> http://www.wienert.cc
>>> >> >> >> stefan@wienert.cc
>>> >> >> >>
>>> >> >> >> Telefon: +495251-2026838
>>> >> >> >> Mobil: +49176-40170270
>>> >> >> >>
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Stefan Wienert
>>> >> >> >
>>> >> >> > http://www.wienert.cc
>>> >> >> > stefan@wienert.cc
>>> >> >> >
>>> >> >> > Telefon: +495251-2026838
>>> >> >> > Mobil: +49176-40170270
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Stefan Wienert
>>> >> >>
>>> >> >> http://www.wienert.cc
>>> >> >> stefan@wienert.cc
>>> >> >>
>>> >> >> Telefon: +495251-2026838
>>> >> >> Mobil: +49176-40170270
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Stefan Wienert
>>> >>
>>> >> http://www.wienert.cc
>>> >> stefan@wienert.cc
>>> >>
>>> >> Telefon: +495251-2026838
>>> >> Mobil: +49176-40170270
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Ted Dunning <te...@gmail.com>.
One thing to watch out for in clustering SVD stuff is eigenspokes.

www.cs.cmu.edu/~badityap/papers/*eigenspokes*-pakdd10.pdf

On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <st...@wienert.cc> wrote:

> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>
> And now your questions:
>
> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
> Document-Term-Matrix and reduce it using Lanczos. So I get the
> Document-Concept-Vectors. These Vectors will be compared with cosine
> similarity to find similar documents. I can also cluster these
> documents with k-means...
>
> If you have any suggestions, feel free to tell me. I'm also interested
> in other document-similarity-techniques.
>
> Cheers,
> Stefan
>
> 2011/6/6 Danny Bickson <da...@gmail.com>:
> > If I understand your question correctly, you need to simply transpose M
> > before you start the run and that way
> > you will get the other singular vectors.
> >
> > May I ask what is the problem you are working on and why do you need the
> > singular vectors?
> > Can you consider using another matrix decomposition technique for example
> > alternating least squares
> > which gives you two lower rank matrices which simulates the large
> decomposed
> > matrix?
> >
> > On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc>
> wrote:
> >
> >> Hi Danny!
> >>
> >> I understand that for M*M' (and for M'*M) the left and right
> >> eigenvectors are identical. But that is not exactly what I want. The
> >> lanczos solver from mahout gives me the eigenvectors of M*M', which
> >> are the left singular vectors of M. But I need the right singular
> >> vectors of M (and not M*M'). How do I get them?
> >>
> >> Sorry, my matrix math is not as good as it should be, but I hope you
> >> can help me!
> >>
> >> Thanks,
> >> Stefan
> >>
> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
> >> > Hi Stefan!
> >> > For a positive semidefinite matrix, the lest and right eigenvectors
> are
> >> > identical.
> >> > See SVD wikipeida text: When *M* is also positive
> >> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
> >> > the decomposition *M* = *U**D**U* * is also a singular value
> >> decomposition.
> >> > So you don't need to be worried about the other singular vectors.
> >> >
> >> > Hope this helps!
> >> >
> >> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
> >> wrote:
> >> >
> >> >> Hi.
> >> >>
> >> >> Thanks for the help.
> >> >>
> >> >> The important points from wikipedia are:
> >> >> - The left singular vectors of M are eigenvectors of M*M' .
> >> >> - The right singular vectors of M are eigenvectors of M'*M.
> >> >>
> >> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
> >> >> it does A=M*M', but it is not a problem). Therefore it does already
> >> >> calculate the right (or left) singular vector of M.
> >> >>
> >> >> But my question is, how can I get the other singular vector? I can
> >> >> transpose M, but then I have to calculated two SVDs, one for the
> right
> >> >> and one for the left singular value... I think there is a better way
> >> >> :)
> >> >>
> >> >> Hope you can help me with this...
> >> >> Thanks
> >> >> Stefan
> >> >>
> >> >>
> >> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
> >> >> > Hi
> >> >> > Mahout SVD implementation computes the Lanzcos iteration:
> >> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
> >> >> > Denote the non-square input matrix as M. First a symmetric matrix A
> is
> >> >> > computed by A=M'*M
> >> >> > Then an approximating tridiagonal matrix T and a vector matrix V
> are
> >> >> > computed such that A =~ V*T*V'
> >> >> > (this process is done in a distributed way).
> >> >> >
> >> >> > Next the matrix T is next decomposed into eigenvectors and
> >> eignevalues.
> >> >> > Which is the returned result. (This process
> >> >> > is serial).
> >> >> >
> >> >> > The third step makes the returned eigenvectors orthogonal to each
> >> other
> >> >> > (which is optional IMHO).
> >> >> >
> >> >> > The heart of the code is found at:
> >> >> >
> >> >>
> >>
> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
> >> >> > At least that is where it was in version 0.4 I am not sure if there
> >> are
> >> >> > changes in version 0.5
> >> >> >
> >> >> > Anyway, Mahout does not compute directly SVD. If you are interested
> in
> >> >> > learning more about the relation to SVD
> >> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition
> ,
> >> >> > subsection: relation to eigenvalue decomposition.
> >> >> >
> >> >> > Hope this helps,
> >> >> >
> >> >> > Danny Bickson
> >> >> >
> >> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
> >> >> wrote:
> >> >> >
> >> >> >> After reading this thread:
> >> >> >>
> >> >> >>
> >> >>
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
> >> >> >>
> >> >> >> Wiki-SVD: M = U S V* (* = transposed)
> >> >> >>
> >> >> >> The output of Mahout-SVD is (U S) right?
> >> >> >>
> >> >> >> So... How do I get V from (U S)  and M?
> >> >> >>
> >> >> >> Is V = M (U S)* (because this is, what the calculation in the
> example
> >> >> is)?
> >> >> >>
> >> >> >> Thanks
> >> >> >> Stefan
> >> >> >>
> >> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> >> >> >> >
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> >> >> >> >
> >> >> >> > What is done:
> >> >> >> >
> >> >> >> > Input:
> >> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
> >> >> >> >
> >> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
> >> >> >> > eigenvalues) of tf-idf-matrix, called:
> >> >> >> > svd (concepts x terms) 87 x 20444
> >> >> >> >
> >> >> >> > transpose tf-idf-matrix:
> >> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
> >> >> >> >
> >> >> >> > transpose svd:
> >> >> >> > svd-transpose (terms x concepts) 20444 x 87
> >> >> >> >
> >> >> >> > matrix multiply:
> >> >> >> > tf-idf-matrix-transpose x svd-transpose = result
> >> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
> >> >> >> >
> >> >> >> > so... I do understand, that the "svd" here is not SVD from
> >> wikipedia.
> >> >> >> > It only does the Lanczos algorithm and some magic which produces
> >> the
> >> >> >> >> Instead either the left or right (but usually the right)
> >> eigenvectors
> >> >> >> premultiplied by the diagonal or the square root of the
> >> >> >> >> diagonal element.
> >> >> >> > from
> >> >> >>
> >> >>
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
> >> >> >> >
> >> >> >> > so my question: what is the output of the SVD in mahout. And
> what
> >> do I
> >> >> >> > have to calculate to get the "right singular value" from svd?
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Stefan
> >> >> >> >
> >> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> >> >> >> >>
> >> >> >> >> the last step is the matrix multiplication:
> >> >> >> >>  --arg --numRowsA --arg 20444 \
> >> >> >> >>  --arg --numColsA --arg 6076937 \
> >> >> >> >>  --arg --numRowsB --arg 20444 \
> >> >> >> >>  --arg --numColsB --arg 87 \
> >> >> >> >> so the result is a 6,076,937 x 87 matrix
> >> >> >> >>
> >> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result
> of
> >> >> >> >> matrix multiplication has to be the right singular value
> regarding
> >> to
> >> >> >> >> the dimensions.
> >> >> >> >>
> >> >> >> >> so the result is the "concept-document vector matrix" (as I
> think,
> >> >> >> >> these is also called "document vectors" ?)
> >> >> >> >>
> >> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
> >> >> >> >>> Yes.  These are term vectors, not document vectors.
> >> >> >> >>>
> >> >> >> >>> There is an additional step that can be run to produce
> document
> >> >> >> vectors.
> >> >> >> >>>
> >> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert
> <stefan@wienert.cc
> >> >
> >> >> >> wrote:
> >> >> >> >>>
> >> >> >> >>>> compared to SVD, is the result is the "right singular value"?
> >> >> >> >>>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Stefan Wienert
> >> >> >> >>
> >> >> >> >> http://www.wienert.cc
> >> >> >> >> stefan@wienert.cc
> >> >> >> >>
> >> >> >> >> Telefon: +495251-2026838
> >> >> >> >> Mobil: +49176-40170270
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Stefan Wienert
> >> >> >> >
> >> >> >> > http://www.wienert.cc
> >> >> >> > stefan@wienert.cc
> >> >> >> >
> >> >> >> > Telefon: +495251-2026838
> >> >> >> > Mobil: +49176-40170270
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Stefan Wienert
> >> >> >>
> >> >> >> http://www.wienert.cc
> >> >> >> stefan@wienert.cc
> >> >> >>
> >> >> >> Telefon: +495251-2026838
> >> >> >> Mobil: +49176-40170270
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Stefan Wienert
> >> >>
> >> >> http://www.wienert.cc
> >> >> stefan@wienert.cc
> >> >>
> >> >> Telefon: +495251-2026838
> >> >> Mobil: +49176-40170270
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Stefan Wienert
> >>
> >> http://www.wienert.cc
> >> stefan@wienert.cc
> >>
> >> Telefon: +495251-2026838
> >> Mobil: +49176-40170270
> >>
> >
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
Yeha :) That's what I assumed. So this problem is solved. Thanks!

And now your questions:

I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
Document-Term-Matrix and reduce it using Lanczos. So I get the
Document-Concept-Vectors. These Vectors will be compared with cosine
similarity to find similar documents. I can also cluster these
documents with k-means...

If you have any suggestions, feel free to tell me. I'm also interested
in other document-similarity-techniques.

Cheers,
Stefan

2011/6/6 Danny Bickson <da...@gmail.com>:
> If I understand your question correctly, you need to simply transpose M
> before you start the run and that way
> you will get the other singular vectors.
>
> May I ask what is the problem you are working on and why do you need the
> singular vectors?
> Can you consider using another matrix decomposition technique for example
> alternating least squares
> which gives you two lower rank matrices which simulates the large decomposed
> matrix?
>
> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>
>> Hi Danny!
>>
>> I understand that for M*M' (and for M'*M) the left and right
>> eigenvectors are identical. But that is not exactly what I want. The
>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>> are the left singular vectors of M. But I need the right singular
>> vectors of M (and not M*M'). How do I get them?
>>
>> Sorry, my matrix math is not as good as it should be, but I hope you
>> can help me!
>>
>> Thanks,
>> Stefan
>>
>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>> > Hi Stefan!
>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>> > identical.
>> > See SVD wikipeida text: When *M* is also positive
>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>> > the decomposition *M* = *U**D**U* * is also a singular value
>> decomposition.
>> > So you don't need to be worried about the other singular vectors.
>> >
>> > Hope this helps!
>> >
>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
>> wrote:
>> >
>> >> Hi.
>> >>
>> >> Thanks for the help.
>> >>
>> >> The important points from wikipedia are:
>> >> - The left singular vectors of M are eigenvectors of M*M' .
>> >> - The right singular vectors of M are eigenvectors of M'*M.
>> >>
>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>> >> it does A=M*M', but it is not a problem). Therefore it does already
>> >> calculate the right (or left) singular vector of M.
>> >>
>> >> But my question is, how can I get the other singular vector? I can
>> >> transpose M, but then I have to calculated two SVDs, one for the right
>> >> and one for the left singular value... I think there is a better way
>> >> :)
>> >>
>> >> Hope you can help me with this...
>> >> Thanks
>> >> Stefan
>> >>
>> >>
>> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
>> >> > Hi
>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
>> >> > computed by A=M'*M
>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>> >> > computed such that A =~ V*T*V'
>> >> > (this process is done in a distributed way).
>> >> >
>> >> > Next the matrix T is next decomposed into eigenvectors and
>> eignevalues.
>> >> > Which is the returned result. (This process
>> >> > is serial).
>> >> >
>> >> > The third step makes the returned eigenvectors orthogonal to each
>> other
>> >> > (which is optional IMHO).
>> >> >
>> >> > The heart of the code is found at:
>> >> >
>> >>
>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>> >> > At least that is where it was in version 0.4 I am not sure if there
>> are
>> >> > changes in version 0.5
>> >> >
>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
>> >> > learning more about the relation to SVD
>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>> >> > subsection: relation to eigenvalue decomposition.
>> >> >
>> >> > Hope this helps,
>> >> >
>> >> > Danny Bickson
>> >> >
>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>> >> wrote:
>> >> >
>> >> >> After reading this thread:
>> >> >>
>> >> >>
>> >>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>> >> >>
>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>> >> >>
>> >> >> The output of Mahout-SVD is (U S) right?
>> >> >>
>> >> >> So... How do I get V from (U S)  and M?
>> >> >>
>> >> >> Is V = M (U S)* (because this is, what the calculation in the example
>> >> is)?
>> >> >>
>> >> >> Thanks
>> >> >> Stefan
>> >> >>
>> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>> >> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>> >> >> >
>> >> >> > What is done:
>> >> >> >
>> >> >> > Input:
>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>> >> >> >
>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>> >> >> > eigenvalues) of tf-idf-matrix, called:
>> >> >> > svd (concepts x terms) 87 x 20444
>> >> >> >
>> >> >> > transpose tf-idf-matrix:
>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>> >> >> >
>> >> >> > transpose svd:
>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>> >> >> >
>> >> >> > matrix multiply:
>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>> >> >> >
>> >> >> > so... I do understand, that the "svd" here is not SVD from
>> wikipedia.
>> >> >> > It only does the Lanczos algorithm and some magic which produces
>> the
>> >> >> >> Instead either the left or right (but usually the right)
>> eigenvectors
>> >> >> premultiplied by the diagonal or the square root of the
>> >> >> >> diagonal element.
>> >> >> > from
>> >> >>
>> >>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>> >> >> >
>> >> >> > so my question: what is the output of the SVD in mahout. And what
>> do I
>> >> >> > have to calculate to get the "right singular value" from svd?
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Stefan
>> >> >> >
>> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>> >> >> >>
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>> >> >> >>
>> >> >> >> the last step is the matrix multiplication:
>> >> >> >>  --arg --numRowsA --arg 20444 \
>> >> >> >>  --arg --numColsA --arg 6076937 \
>> >> >> >>  --arg --numRowsB --arg 20444 \
>> >> >> >>  --arg --numColsB --arg 87 \
>> >> >> >> so the result is a 6,076,937 x 87 matrix
>> >> >> >>
>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>> >> >> >> matrix multiplication has to be the right singular value regarding
>> to
>> >> >> >> the dimensions.
>> >> >> >>
>> >> >> >> so the result is the "concept-document vector matrix" (as I think,
>> >> >> >> these is also called "document vectors" ?)
>> >> >> >>
>> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>> >> >> >>> Yes.  These are term vectors, not document vectors.
>> >> >> >>>
>> >> >> >>> There is an additional step that can be run to produce document
>> >> >> vectors.
>> >> >> >>>
>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
>> >
>> >> >> wrote:
>> >> >> >>>
>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>> >> >> >>>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Stefan Wienert
>> >> >> >>
>> >> >> >> http://www.wienert.cc
>> >> >> >> stefan@wienert.cc
>> >> >> >>
>> >> >> >> Telefon: +495251-2026838
>> >> >> >> Mobil: +49176-40170270
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Stefan Wienert
>> >> >> >
>> >> >> > http://www.wienert.cc
>> >> >> > stefan@wienert.cc
>> >> >> >
>> >> >> > Telefon: +495251-2026838
>> >> >> > Mobil: +49176-40170270
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Stefan Wienert
>> >> >>
>> >> >> http://www.wienert.cc
>> >> >> stefan@wienert.cc
>> >> >>
>> >> >> Telefon: +495251-2026838
>> >> >> Mobil: +49176-40170270
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Stefan Wienert
>> >>
>> >> http://www.wienert.cc
>> >> stefan@wienert.cc
>> >>
>> >> Telefon: +495251-2026838
>> >> Mobil: +49176-40170270
>> >>
>> >
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Ted Dunning <te...@gmail.com>.
The current arrangement makes it easier to compute the singular values of
new documents.  It requires an extra operation either to transpose or to do
a multiply to get the vectors for the documents (rows).

On Mon, Jun 6, 2011 at 10:36 AM, Danny Bickson <da...@gmail.com>wrote:

> If I understand your question correctly, you need to simply transpose M
> before you start the run and that way
> you will get the other singular vectors.
>
> May I ask what is the problem you are working on and why do you need the
> singular vectors?
> Can you consider using another matrix decomposition technique for example
> alternating least squares
> which gives you two lower rank matrices which simulates the large
> decomposed
> matrix?
>
> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:
>
> > Hi Danny!
> >
> > I understand that for M*M' (and for M'*M) the left and right
> > eigenvectors are identical. But that is not exactly what I want. The
> > lanczos solver from mahout gives me the eigenvectors of M*M', which
> > are the left singular vectors of M. But I need the right singular
> > vectors of M (and not M*M'). How do I get them?
> >
> > Sorry, my matrix math is not as good as it should be, but I hope you
> > can help me!
> >
> > Thanks,
> > Stefan
> >
> > 2011/6/6 Danny Bickson <da...@gmail.com>:
> > > Hi Stefan!
> > > For a positive semidefinite matrix, the lest and right eigenvectors are
> > > identical.
> > > See SVD wikipeida text: When *M* is also positive
> > > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
> > > the decomposition *M* = *U**D**U* * is also a singular value
> > decomposition.
> > > So you don't need to be worried about the other singular vectors.
> > >
> > > Hope this helps!
> > >
> > > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
> > wrote:
> > >
> > >> Hi.
> > >>
> > >> Thanks for the help.
> > >>
> > >> The important points from wikipedia are:
> > >> - The left singular vectors of M are eigenvectors of M*M' .
> > >> - The right singular vectors of M are eigenvectors of M'*M.
> > >>
> > >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
> > >> it does A=M*M', but it is not a problem). Therefore it does already
> > >> calculate the right (or left) singular vector of M.
> > >>
> > >> But my question is, how can I get the other singular vector? I can
> > >> transpose M, but then I have to calculated two SVDs, one for the right
> > >> and one for the left singular value... I think there is a better way
> > >> :)
> > >>
> > >> Hope you can help me with this...
> > >> Thanks
> > >> Stefan
> > >>
> > >>
> > >> 2011/6/6 Danny Bickson <da...@gmail.com>:
> > >> > Hi
> > >> > Mahout SVD implementation computes the Lanzcos iteration:
> > >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
> > >> > Denote the non-square input matrix as M. First a symmetric matrix A
> is
> > >> > computed by A=M'*M
> > >> > Then an approximating tridiagonal matrix T and a vector matrix V are
> > >> > computed such that A =~ V*T*V'
> > >> > (this process is done in a distributed way).
> > >> >
> > >> > Next the matrix T is next decomposed into eigenvectors and
> > eignevalues.
> > >> > Which is the returned result. (This process
> > >> > is serial).
> > >> >
> > >> > The third step makes the returned eigenvectors orthogonal to each
> > other
> > >> > (which is optional IMHO).
> > >> >
> > >> > The heart of the code is found at:
> > >> >
> > >>
> >
> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
> > >> > At least that is where it was in version 0.4 I am not sure if there
> > are
> > >> > changes in version 0.5
> > >> >
> > >> > Anyway, Mahout does not compute directly SVD. If you are interested
> in
> > >> > learning more about the relation to SVD
> > >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
> > >> > subsection: relation to eigenvalue decomposition.
> > >> >
> > >> > Hope this helps,
> > >> >
> > >> > Danny Bickson
> > >> >
> > >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
> > >> wrote:
> > >> >
> > >> >> After reading this thread:
> > >> >>
> > >> >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
> > >> >>
> > >> >> Wiki-SVD: M = U S V* (* = transposed)
> > >> >>
> > >> >> The output of Mahout-SVD is (U S) right?
> > >> >>
> > >> >> So... How do I get V from (U S)  and M?
> > >> >>
> > >> >> Is V = M (U S)* (because this is, what the calculation in the
> example
> > >> is)?
> > >> >>
> > >> >> Thanks
> > >> >> Stefan
> > >> >>
> > >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> > >> >> >
> > >>
> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> > >> >> >
> > >> >> > What is done:
> > >> >> >
> > >> >> > Input:
> > >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
> > >> >> >
> > >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
> > >> >> > eigenvalues) of tf-idf-matrix, called:
> > >> >> > svd (concepts x terms) 87 x 20444
> > >> >> >
> > >> >> > transpose tf-idf-matrix:
> > >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
> > >> >> >
> > >> >> > transpose svd:
> > >> >> > svd-transpose (terms x concepts) 20444 x 87
> > >> >> >
> > >> >> > matrix multiply:
> > >> >> > tf-idf-matrix-transpose x svd-transpose = result
> > >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
> > >> >> >
> > >> >> > so... I do understand, that the "svd" here is not SVD from
> > wikipedia.
> > >> >> > It only does the Lanczos algorithm and some magic which produces
> > the
> > >> >> >> Instead either the left or right (but usually the right)
> > eigenvectors
> > >> >> premultiplied by the diagonal or the square root of the
> > >> >> >> diagonal element.
> > >> >> > from
> > >> >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
> > >> >> >
> > >> >> > so my question: what is the output of the SVD in mahout. And what
> > do I
> > >> >> > have to calculate to get the "right singular value" from svd?
> > >> >> >
> > >> >> > Thanks,
> > >> >> > Stefan
> > >> >> >
> > >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> > >> >> >>
> > >> >>
> > >>
> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> > >> >> >>
> > >> >> >> the last step is the matrix multiplication:
> > >> >> >>  --arg --numRowsA --arg 20444 \
> > >> >> >>  --arg --numColsA --arg 6076937 \
> > >> >> >>  --arg --numRowsB --arg 20444 \
> > >> >> >>  --arg --numColsB --arg 87 \
> > >> >> >> so the result is a 6,076,937 x 87 matrix
> > >> >> >>
> > >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result
> of
> > >> >> >> matrix multiplication has to be the right singular value
> regarding
> > to
> > >> >> >> the dimensions.
> > >> >> >>
> > >> >> >> so the result is the "concept-document vector matrix" (as I
> think,
> > >> >> >> these is also called "document vectors" ?)
> > >> >> >>
> > >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
> > >> >> >>> Yes.  These are term vectors, not document vectors.
> > >> >> >>>
> > >> >> >>> There is an additional step that can be run to produce document
> > >> >> vectors.
> > >> >> >>>
> > >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert
> <stefan@wienert.cc
> > >
> > >> >> wrote:
> > >> >> >>>
> > >> >> >>>> compared to SVD, is the result is the "right singular value"?
> > >> >> >>>>
> > >> >> >>>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> --
> > >> >> >> Stefan Wienert
> > >> >> >>
> > >> >> >> http://www.wienert.cc
> > >> >> >> stefan@wienert.cc
> > >> >> >>
> > >> >> >> Telefon: +495251-2026838
> > >> >> >> Mobil: +49176-40170270
> > >> >> >>
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > --
> > >> >> > Stefan Wienert
> > >> >> >
> > >> >> > http://www.wienert.cc
> > >> >> > stefan@wienert.cc
> > >> >> >
> > >> >> > Telefon: +495251-2026838
> > >> >> > Mobil: +49176-40170270
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Stefan Wienert
> > >> >>
> > >> >> http://www.wienert.cc
> > >> >> stefan@wienert.cc
> > >> >>
> > >> >> Telefon: +495251-2026838
> > >> >> Mobil: +49176-40170270
> > >> >>
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Stefan Wienert
> > >>
> > >> http://www.wienert.cc
> > >> stefan@wienert.cc
> > >>
> > >> Telefon: +495251-2026838
> > >> Mobil: +49176-40170270
> > >>
> > >
> >
> >
> >
> > --
> > Stefan Wienert
> >
> > http://www.wienert.cc
> > stefan@wienert.cc
> >
> > Telefon: +495251-2026838
> > Mobil: +49176-40170270
> >
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Danny Bickson <da...@gmail.com>.
If I understand your question correctly, you need to simply transpose M
before you start the run and that way
you will get the other singular vectors.

May I ask what is the problem you are working on and why do you need the
singular vectors?
Can you consider using another matrix decomposition technique for example
alternating least squares
which gives you two lower rank matrices which simulates the large decomposed
matrix?

On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <st...@wienert.cc> wrote:

> Hi Danny!
>
> I understand that for M*M' (and for M'*M) the left and right
> eigenvectors are identical. But that is not exactly what I want. The
> lanczos solver from mahout gives me the eigenvectors of M*M', which
> are the left singular vectors of M. But I need the right singular
> vectors of M (and not M*M'). How do I get them?
>
> Sorry, my matrix math is not as good as it should be, but I hope you
> can help me!
>
> Thanks,
> Stefan
>
> 2011/6/6 Danny Bickson <da...@gmail.com>:
> > Hi Stefan!
> > For a positive semidefinite matrix, the lest and right eigenvectors are
> > identical.
> > See SVD wikipeida text: When *M* is also positive
> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
> > the decomposition *M* = *U**D**U* * is also a singular value
> decomposition.
> > So you don't need to be worried about the other singular vectors.
> >
> > Hope this helps!
> >
> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc>
> wrote:
> >
> >> Hi.
> >>
> >> Thanks for the help.
> >>
> >> The important points from wikipedia are:
> >> - The left singular vectors of M are eigenvectors of M*M' .
> >> - The right singular vectors of M are eigenvectors of M'*M.
> >>
> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
> >> it does A=M*M', but it is not a problem). Therefore it does already
> >> calculate the right (or left) singular vector of M.
> >>
> >> But my question is, how can I get the other singular vector? I can
> >> transpose M, but then I have to calculated two SVDs, one for the right
> >> and one for the left singular value... I think there is a better way
> >> :)
> >>
> >> Hope you can help me with this...
> >> Thanks
> >> Stefan
> >>
> >>
> >> 2011/6/6 Danny Bickson <da...@gmail.com>:
> >> > Hi
> >> > Mahout SVD implementation computes the Lanzcos iteration:
> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
> >> > Denote the non-square input matrix as M. First a symmetric matrix A is
> >> > computed by A=M'*M
> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
> >> > computed such that A =~ V*T*V'
> >> > (this process is done in a distributed way).
> >> >
> >> > Next the matrix T is next decomposed into eigenvectors and
> eignevalues.
> >> > Which is the returned result. (This process
> >> > is serial).
> >> >
> >> > The third step makes the returned eigenvectors orthogonal to each
> other
> >> > (which is optional IMHO).
> >> >
> >> > The heart of the code is found at:
> >> >
> >>
> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
> >> > At least that is where it was in version 0.4 I am not sure if there
> are
> >> > changes in version 0.5
> >> >
> >> > Anyway, Mahout does not compute directly SVD. If you are interested in
> >> > learning more about the relation to SVD
> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
> >> > subsection: relation to eigenvalue decomposition.
> >> >
> >> > Hope this helps,
> >> >
> >> > Danny Bickson
> >> >
> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
> >> wrote:
> >> >
> >> >> After reading this thread:
> >> >>
> >> >>
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
> >> >>
> >> >> Wiki-SVD: M = U S V* (* = transposed)
> >> >>
> >> >> The output of Mahout-SVD is (U S) right?
> >> >>
> >> >> So... How do I get V from (U S)  and M?
> >> >>
> >> >> Is V = M (U S)* (because this is, what the calculation in the example
> >> is)?
> >> >>
> >> >> Thanks
> >> >> Stefan
> >> >>
> >> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> >> >> >
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> >> >> >
> >> >> > What is done:
> >> >> >
> >> >> > Input:
> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
> >> >> >
> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
> >> >> > eigenvalues) of tf-idf-matrix, called:
> >> >> > svd (concepts x terms) 87 x 20444
> >> >> >
> >> >> > transpose tf-idf-matrix:
> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
> >> >> >
> >> >> > transpose svd:
> >> >> > svd-transpose (terms x concepts) 20444 x 87
> >> >> >
> >> >> > matrix multiply:
> >> >> > tf-idf-matrix-transpose x svd-transpose = result
> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
> >> >> >
> >> >> > so... I do understand, that the "svd" here is not SVD from
> wikipedia.
> >> >> > It only does the Lanczos algorithm and some magic which produces
> the
> >> >> >> Instead either the left or right (but usually the right)
> eigenvectors
> >> >> premultiplied by the diagonal or the square root of the
> >> >> >> diagonal element.
> >> >> > from
> >> >>
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
> >> >> >
> >> >> > so my question: what is the output of the SVD in mahout. And what
> do I
> >> >> > have to calculate to get the "right singular value" from svd?
> >> >> >
> >> >> > Thanks,
> >> >> > Stefan
> >> >> >
> >> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> >> >> >>
> >> >> >> the last step is the matrix multiplication:
> >> >> >>  --arg --numRowsA --arg 20444 \
> >> >> >>  --arg --numColsA --arg 6076937 \
> >> >> >>  --arg --numRowsB --arg 20444 \
> >> >> >>  --arg --numColsB --arg 87 \
> >> >> >> so the result is a 6,076,937 x 87 matrix
> >> >> >>
> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
> >> >> >> matrix multiplication has to be the right singular value regarding
> to
> >> >> >> the dimensions.
> >> >> >>
> >> >> >> so the result is the "concept-document vector matrix" (as I think,
> >> >> >> these is also called "document vectors" ?)
> >> >> >>
> >> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
> >> >> >>> Yes.  These are term vectors, not document vectors.
> >> >> >>>
> >> >> >>> There is an additional step that can be run to produce document
> >> >> vectors.
> >> >> >>>
> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <stefan@wienert.cc
> >
> >> >> wrote:
> >> >> >>>
> >> >> >>>> compared to SVD, is the result is the "right singular value"?
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Stefan Wienert
> >> >> >>
> >> >> >> http://www.wienert.cc
> >> >> >> stefan@wienert.cc
> >> >> >>
> >> >> >> Telefon: +495251-2026838
> >> >> >> Mobil: +49176-40170270
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Stefan Wienert
> >> >> >
> >> >> > http://www.wienert.cc
> >> >> > stefan@wienert.cc
> >> >> >
> >> >> > Telefon: +495251-2026838
> >> >> > Mobil: +49176-40170270
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Stefan Wienert
> >> >>
> >> >> http://www.wienert.cc
> >> >> stefan@wienert.cc
> >> >>
> >> >> Telefon: +495251-2026838
> >> >> Mobil: +49176-40170270
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Stefan Wienert
> >>
> >> http://www.wienert.cc
> >> stefan@wienert.cc
> >>
> >> Telefon: +495251-2026838
> >> Mobil: +49176-40170270
> >>
> >
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
Hi Danny!

I understand that for M*M' (and for M'*M) the left and right
eigenvectors are identical. But that is not exactly what I want. The
lanczos solver from mahout gives me the eigenvectors of M*M', which
are the left singular vectors of M. But I need the right singular
vectors of M (and not M*M'). How do I get them?

Sorry, my matrix math is not as good as it should be, but I hope you
can help me!

Thanks,
Stefan

2011/6/6 Danny Bickson <da...@gmail.com>:
> Hi Stefan!
> For a positive semidefinite matrix, the lest and right eigenvectors are
> identical.
> See SVD wikipeida text: When *M* is also positive
> semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
> the decomposition *M* = *U**D**U* * is also a singular value decomposition.
> So you don't need to be worried about the other singular vectors.
>
> Hope this helps!
>
> On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc> wrote:
>
>> Hi.
>>
>> Thanks for the help.
>>
>> The important points from wikipedia are:
>> - The left singular vectors of M are eigenvectors of M*M' .
>> - The right singular vectors of M are eigenvectors of M'*M.
>>
>> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>> it does A=M*M', but it is not a problem). Therefore it does already
>> calculate the right (or left) singular vector of M.
>>
>> But my question is, how can I get the other singular vector? I can
>> transpose M, but then I have to calculated two SVDs, one for the right
>> and one for the left singular value... I think there is a better way
>> :)
>>
>> Hope you can help me with this...
>> Thanks
>> Stefan
>>
>>
>> 2011/6/6 Danny Bickson <da...@gmail.com>:
>> > Hi
>> > Mahout SVD implementation computes the Lanzcos iteration:
>> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>> > Denote the non-square input matrix as M. First a symmetric matrix A is
>> > computed by A=M'*M
>> > Then an approximating tridiagonal matrix T and a vector matrix V are
>> > computed such that A =~ V*T*V'
>> > (this process is done in a distributed way).
>> >
>> > Next the matrix T is next decomposed into eigenvectors and eignevalues.
>> > Which is the returned result. (This process
>> > is serial).
>> >
>> > The third step makes the returned eigenvectors orthogonal to each other
>> > (which is optional IMHO).
>> >
>> > The heart of the code is found at:
>> >
>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>> > At least that is where it was in version 0.4 I am not sure if there are
>> > changes in version 0.5
>> >
>> > Anyway, Mahout does not compute directly SVD. If you are interested in
>> > learning more about the relation to SVD
>> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>> > subsection: relation to eigenvalue decomposition.
>> >
>> > Hope this helps,
>> >
>> > Danny Bickson
>> >
>> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
>> wrote:
>> >
>> >> After reading this thread:
>> >>
>> >>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>> >>
>> >> Wiki-SVD: M = U S V* (* = transposed)
>> >>
>> >> The output of Mahout-SVD is (U S) right?
>> >>
>> >> So... How do I get V from (U S)  and M?
>> >>
>> >> Is V = M (U S)* (because this is, what the calculation in the example
>> is)?
>> >>
>> >> Thanks
>> >> Stefan
>> >>
>> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>> >> >
>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>> >> >
>> >> > What is done:
>> >> >
>> >> > Input:
>> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>> >> >
>> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>> >> > eigenvalues) of tf-idf-matrix, called:
>> >> > svd (concepts x terms) 87 x 20444
>> >> >
>> >> > transpose tf-idf-matrix:
>> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>> >> >
>> >> > transpose svd:
>> >> > svd-transpose (terms x concepts) 20444 x 87
>> >> >
>> >> > matrix multiply:
>> >> > tf-idf-matrix-transpose x svd-transpose = result
>> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>> >> >
>> >> > so... I do understand, that the "svd" here is not SVD from wikipedia.
>> >> > It only does the Lanczos algorithm and some magic which produces the
>> >> >> Instead either the left or right (but usually the right) eigenvectors
>> >> premultiplied by the diagonal or the square root of the
>> >> >> diagonal element.
>> >> > from
>> >>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>> >> >
>> >> > so my question: what is the output of the SVD in mahout. And what do I
>> >> > have to calculate to get the "right singular value" from svd?
>> >> >
>> >> > Thanks,
>> >> > Stefan
>> >> >
>> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>> >> >>
>> >> >> the last step is the matrix multiplication:
>> >> >>  --arg --numRowsA --arg 20444 \
>> >> >>  --arg --numColsA --arg 6076937 \
>> >> >>  --arg --numRowsB --arg 20444 \
>> >> >>  --arg --numColsB --arg 87 \
>> >> >> so the result is a 6,076,937 x 87 matrix
>> >> >>
>> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>> >> >> matrix multiplication has to be the right singular value regarding to
>> >> >> the dimensions.
>> >> >>
>> >> >> so the result is the "concept-document vector matrix" (as I think,
>> >> >> these is also called "document vectors" ?)
>> >> >>
>> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>> >> >>> Yes.  These are term vectors, not document vectors.
>> >> >>>
>> >> >>> There is an additional step that can be run to produce document
>> >> vectors.
>> >> >>>
>> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <st...@wienert.cc>
>> >> wrote:
>> >> >>>
>> >> >>>> compared to SVD, is the result is the "right singular value"?
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Stefan Wienert
>> >> >>
>> >> >> http://www.wienert.cc
>> >> >> stefan@wienert.cc
>> >> >>
>> >> >> Telefon: +495251-2026838
>> >> >> Mobil: +49176-40170270
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Stefan Wienert
>> >> >
>> >> > http://www.wienert.cc
>> >> > stefan@wienert.cc
>> >> >
>> >> > Telefon: +495251-2026838
>> >> > Mobil: +49176-40170270
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Stefan Wienert
>> >>
>> >> http://www.wienert.cc
>> >> stefan@wienert.cc
>> >>
>> >> Telefon: +495251-2026838
>> >> Mobil: +49176-40170270
>> >>
>> >
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Danny Bickson <da...@gmail.com>.
Hi Stefan!
For a positive semidefinite matrix, the lest and right eigenvectors are
identical.
See SVD wikipeida text: When *M* is also positive
semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
the decomposition *M* = *U**D**U* * is also a singular value decomposition.
So you don't need to be worried about the other singular vectors.

Hope this helps!

On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <st...@wienert.cc> wrote:

> Hi.
>
> Thanks for the help.
>
> The important points from wikipedia are:
> - The left singular vectors of M are eigenvectors of M*M' .
> - The right singular vectors of M are eigenvectors of M'*M.
>
> as you describe, the mahout lanczos solver calculate A=M'*M (I think
> it does A=M*M', but it is not a problem). Therefore it does already
> calculate the right (or left) singular vector of M.
>
> But my question is, how can I get the other singular vector? I can
> transpose M, but then I have to calculated two SVDs, one for the right
> and one for the left singular value... I think there is a better way
> :)
>
> Hope you can help me with this...
> Thanks
> Stefan
>
>
> 2011/6/6 Danny Bickson <da...@gmail.com>:
> > Hi
> > Mahout SVD implementation computes the Lanzcos iteration:
> > http://en.wikipedia.org/wiki/Lanczos_algorithm
> > Denote the non-square input matrix as M. First a symmetric matrix A is
> > computed by A=M'*M
> > Then an approximating tridiagonal matrix T and a vector matrix V are
> > computed such that A =~ V*T*V'
> > (this process is done in a distributed way).
> >
> > Next the matrix T is next decomposed into eigenvectors and eignevalues.
> > Which is the returned result. (This process
> > is serial).
> >
> > The third step makes the returned eigenvectors orthogonal to each other
> > (which is optional IMHO).
> >
> > The heart of the code is found at:
> >
> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
> > At least that is where it was in version 0.4 I am not sure if there are
> > changes in version 0.5
> >
> > Anyway, Mahout does not compute directly SVD. If you are interested in
> > learning more about the relation to SVD
> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
> > subsection: relation to eigenvalue decomposition.
> >
> > Hope this helps,
> >
> > Danny Bickson
> >
> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc>
> wrote:
> >
> >> After reading this thread:
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
> >>
> >> Wiki-SVD: M = U S V* (* = transposed)
> >>
> >> The output of Mahout-SVD is (U S) right?
> >>
> >> So... How do I get V from (U S)  and M?
> >>
> >> Is V = M (U S)* (because this is, what the calculation in the example
> is)?
> >>
> >> Thanks
> >> Stefan
> >>
> >> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> >> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> >> >
> >> > What is done:
> >> >
> >> > Input:
> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
> >> >
> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
> >> > eigenvalues) of tf-idf-matrix, called:
> >> > svd (concepts x terms) 87 x 20444
> >> >
> >> > transpose tf-idf-matrix:
> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
> >> >
> >> > transpose svd:
> >> > svd-transpose (terms x concepts) 20444 x 87
> >> >
> >> > matrix multiply:
> >> > tf-idf-matrix-transpose x svd-transpose = result
> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
> >> >
> >> > so... I do understand, that the "svd" here is not SVD from wikipedia.
> >> > It only does the Lanczos algorithm and some magic which produces the
> >> >> Instead either the left or right (but usually the right) eigenvectors
> >> premultiplied by the diagonal or the square root of the
> >> >> diagonal element.
> >> > from
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
> >> >
> >> > so my question: what is the output of the SVD in mahout. And what do I
> >> > have to calculate to get the "right singular value" from svd?
> >> >
> >> > Thanks,
> >> > Stefan
> >> >
> >> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> >> >>
> >> >> the last step is the matrix multiplication:
> >> >>  --arg --numRowsA --arg 20444 \
> >> >>  --arg --numColsA --arg 6076937 \
> >> >>  --arg --numRowsB --arg 20444 \
> >> >>  --arg --numColsB --arg 87 \
> >> >> so the result is a 6,076,937 x 87 matrix
> >> >>
> >> >> the input has 6,076,937 (each with 20,444 terms). so the result of
> >> >> matrix multiplication has to be the right singular value regarding to
> >> >> the dimensions.
> >> >>
> >> >> so the result is the "concept-document vector matrix" (as I think,
> >> >> these is also called "document vectors" ?)
> >> >>
> >> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
> >> >>> Yes.  These are term vectors, not document vectors.
> >> >>>
> >> >>> There is an additional step that can be run to produce document
> >> vectors.
> >> >>>
> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <st...@wienert.cc>
> >> wrote:
> >> >>>
> >> >>>> compared to SVD, is the result is the "right singular value"?
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Stefan Wienert
> >> >>
> >> >> http://www.wienert.cc
> >> >> stefan@wienert.cc
> >> >>
> >> >> Telefon: +495251-2026838
> >> >> Mobil: +49176-40170270
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Stefan Wienert
> >> >
> >> > http://www.wienert.cc
> >> > stefan@wienert.cc
> >> >
> >> > Telefon: +495251-2026838
> >> > Mobil: +49176-40170270
> >> >
> >>
> >>
> >>
> >> --
> >> Stefan Wienert
> >>
> >> http://www.wienert.cc
> >> stefan@wienert.cc
> >>
> >> Telefon: +495251-2026838
> >> Mobil: +49176-40170270
> >>
> >
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
Hi.

Thanks for the help.

The important points from wikipedia are:
- The left singular vectors of M are eigenvectors of M*M' .
- The right singular vectors of M are eigenvectors of M'*M.

as you describe, the mahout lanczos solver calculate A=M'*M (I think
it does A=M*M', but it is not a problem). Therefore it does already
calculate the right (or left) singular vector of M.

But my question is, how can I get the other singular vector? I can
transpose M, but then I have to calculated two SVDs, one for the right
and one for the left singular value... I think there is a better way
:)

Hope you can help me with this...
Thanks
Stefan


2011/6/6 Danny Bickson <da...@gmail.com>:
> Hi
> Mahout SVD implementation computes the Lanzcos iteration:
> http://en.wikipedia.org/wiki/Lanczos_algorithm
> Denote the non-square input matrix as M. First a symmetric matrix A is
> computed by A=M'*M
> Then an approximating tridiagonal matrix T and a vector matrix V are
> computed such that A =~ V*T*V'
> (this process is done in a distributed way).
>
> Next the matrix T is next decomposed into eigenvectors and eignevalues.
> Which is the returned result. (This process
> is serial).
>
> The third step makes the returned eigenvectors orthogonal to each other
> (which is optional IMHO).
>
> The heart of the code is found at:
> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
> At least that is where it was in version 0.4 I am not sure if there are
> changes in version 0.5
>
> Anyway, Mahout does not compute directly SVD. If you are interested in
> learning more about the relation to SVD
> look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
> subsection: relation to eigenvalue decomposition.
>
> Hope this helps,
>
> Danny Bickson
>
> On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc> wrote:
>
>> After reading this thread:
>>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>
>> Wiki-SVD: M = U S V* (* = transposed)
>>
>> The output of Mahout-SVD is (U S) right?
>>
>> So... How do I get V from (U S)  and M?
>>
>> Is V = M (U S)* (because this is, what the calculation in the example is)?
>>
>> Thanks
>> Stefan
>>
>> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>> >
>> > What is done:
>> >
>> > Input:
>> > tf-idf-matrix (docs x terms) 6076937 x 20444
>> >
>> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>> > eigenvalues) of tf-idf-matrix, called:
>> > svd (concepts x terms) 87 x 20444
>> >
>> > transpose tf-idf-matrix:
>> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>> >
>> > transpose svd:
>> > svd-transpose (terms x concepts) 20444 x 87
>> >
>> > matrix multiply:
>> > tf-idf-matrix-transpose x svd-transpose = result
>> > (terms x docs) x (terms x concepts) = (docs x concepts)
>> >
>> > so... I do understand, that the "svd" here is not SVD from wikipedia.
>> > It only does the Lanczos algorithm and some magic which produces the
>> >> Instead either the left or right (but usually the right) eigenvectors
>> premultiplied by the diagonal or the square root of the
>> >> diagonal element.
>> > from
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>> >
>> > so my question: what is the output of the SVD in mahout. And what do I
>> > have to calculate to get the "right singular value" from svd?
>> >
>> > Thanks,
>> > Stefan
>> >
>> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>> >>
>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>> >>
>> >> the last step is the matrix multiplication:
>> >>  --arg --numRowsA --arg 20444 \
>> >>  --arg --numColsA --arg 6076937 \
>> >>  --arg --numRowsB --arg 20444 \
>> >>  --arg --numColsB --arg 87 \
>> >> so the result is a 6,076,937 x 87 matrix
>> >>
>> >> the input has 6,076,937 (each with 20,444 terms). so the result of
>> >> matrix multiplication has to be the right singular value regarding to
>> >> the dimensions.
>> >>
>> >> so the result is the "concept-document vector matrix" (as I think,
>> >> these is also called "document vectors" ?)
>> >>
>> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
>> >>> Yes.  These are term vectors, not document vectors.
>> >>>
>> >>> There is an additional step that can be run to produce document
>> vectors.
>> >>>
>> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <st...@wienert.cc>
>> wrote:
>> >>>
>> >>>> compared to SVD, is the result is the "right singular value"?
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Stefan Wienert
>> >>
>> >> http://www.wienert.cc
>> >> stefan@wienert.cc
>> >>
>> >> Telefon: +495251-2026838
>> >> Mobil: +49176-40170270
>> >>
>> >
>> >
>> >
>> > --
>> > Stefan Wienert
>> >
>> > http://www.wienert.cc
>> > stefan@wienert.cc
>> >
>> > Telefon: +495251-2026838
>> > Mobil: +49176-40170270
>> >
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Danny Bickson <da...@gmail.com>.
Hi
Mahout SVD implementation computes the Lanzcos iteration:
http://en.wikipedia.org/wiki/Lanczos_algorithm
Denote the non-square input matrix as M. First a symmetric matrix A is
computed by A=M'*M
Then an approximating tridiagonal matrix T and a vector matrix V are
computed such that A =~ V*T*V'
(this process is done in a distributed way).

Next the matrix T is next decomposed into eigenvectors and eignevalues.
Which is the returned result. (This process
is serial).

The third step makes the returned eigenvectors orthogonal to each other
(which is optional IMHO).

The heart of the code is found at:
./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
At least that is where it was in version 0.4 I am not sure if there are
changes in version 0.5

Anyway, Mahout does not compute directly SVD. If you are interested in
learning more about the relation to SVD
look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
subsection: relation to eigenvalue decomposition.

Hope this helps,

Danny Bickson

On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <st...@wienert.cc> wrote:

> After reading this thread:
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>
> Wiki-SVD: M = U S V* (* = transposed)
>
> The output of Mahout-SVD is (U S) right?
>
> So... How do I get V from (U S)  and M?
>
> Is V = M (U S)* (because this is, what the calculation in the example is)?
>
> Thanks
> Stefan
>
> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> >
> > What is done:
> >
> > Input:
> > tf-idf-matrix (docs x terms) 6076937 x 20444
> >
> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
> > eigenvalues) of tf-idf-matrix, called:
> > svd (concepts x terms) 87 x 20444
> >
> > transpose tf-idf-matrix:
> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
> >
> > transpose svd:
> > svd-transpose (terms x concepts) 20444 x 87
> >
> > matrix multiply:
> > tf-idf-matrix-transpose x svd-transpose = result
> > (terms x docs) x (terms x concepts) = (docs x concepts)
> >
> > so... I do understand, that the "svd" here is not SVD from wikipedia.
> > It only does the Lanczos algorithm and some magic which produces the
> >> Instead either the left or right (but usually the right) eigenvectors
> premultiplied by the diagonal or the square root of the
> >> diagonal element.
> > from
> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
> >
> > so my question: what is the output of the SVD in mahout. And what do I
> > have to calculate to get the "right singular value" from svd?
> >
> > Thanks,
> > Stefan
> >
> > 2011/6/6 Stefan Wienert <st...@wienert.cc>:
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
> >>
> >> the last step is the matrix multiplication:
> >>  --arg --numRowsA --arg 20444 \
> >>  --arg --numColsA --arg 6076937 \
> >>  --arg --numRowsB --arg 20444 \
> >>  --arg --numColsB --arg 87 \
> >> so the result is a 6,076,937 x 87 matrix
> >>
> >> the input has 6,076,937 (each with 20,444 terms). so the result of
> >> matrix multiplication has to be the right singular value regarding to
> >> the dimensions.
> >>
> >> so the result is the "concept-document vector matrix" (as I think,
> >> these is also called "document vectors" ?)
> >>
> >> 2011/6/6 Ted Dunning <te...@gmail.com>:
> >>> Yes.  These are term vectors, not document vectors.
> >>>
> >>> There is an additional step that can be run to produce document
> vectors.
> >>>
> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <st...@wienert.cc>
> wrote:
> >>>
> >>>> compared to SVD, is the result is the "right singular value"?
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Stefan Wienert
> >>
> >> http://www.wienert.cc
> >> stefan@wienert.cc
> >>
> >> Telefon: +495251-2026838
> >> Mobil: +49176-40170270
> >>
> >
> >
> >
> > --
> > Stefan Wienert
> >
> > http://www.wienert.cc
> > stefan@wienert.cc
> >
> > Telefon: +495251-2026838
> > Mobil: +49176-40170270
> >
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
After reading this thread:
http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E

Wiki-SVD: M = U S V* (* = transposed)

The output of Mahout-SVD is (U S) right?

So... How do I get V from (U S)  and M?

Is V = M (U S)* (because this is, what the calculation in the example is)?

Thanks
Stefan

2011/6/6 Stefan Wienert <st...@wienert.cc>:
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>
> What is done:
>
> Input:
> tf-idf-matrix (docs x terms) 6076937 x 20444
>
> "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
> eigenvalues) of tf-idf-matrix, called:
> svd (concepts x terms) 87 x 20444
>
> transpose tf-idf-matrix:
> tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>
> transpose svd:
> svd-transpose (terms x concepts) 20444 x 87
>
> matrix multiply:
> tf-idf-matrix-transpose x svd-transpose = result
> (terms x docs) x (terms x concepts) = (docs x concepts)
>
> so... I do understand, that the "svd" here is not SVD from wikipedia.
> It only does the Lanczos algorithm and some magic which produces the
>> Instead either the left or right (but usually the right) eigenvectors premultiplied by the diagonal or the square root of the
>> diagonal element.
> from http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>
> so my question: what is the output of the SVD in mahout. And what do I
> have to calculate to get the "right singular value" from svd?
>
> Thanks,
> Stefan
>
> 2011/6/6 Stefan Wienert <st...@wienert.cc>:
>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>
>> the last step is the matrix multiplication:
>>  --arg --numRowsA --arg 20444 \
>>  --arg --numColsA --arg 6076937 \
>>  --arg --numRowsB --arg 20444 \
>>  --arg --numColsB --arg 87 \
>> so the result is a 6,076,937 x 87 matrix
>>
>> the input has 6,076,937 (each with 20,444 terms). so the result of
>> matrix multiplication has to be the right singular value regarding to
>> the dimensions.
>>
>> so the result is the "concept-document vector matrix" (as I think,
>> these is also called "document vectors" ?)
>>
>> 2011/6/6 Ted Dunning <te...@gmail.com>:
>>> Yes.  These are term vectors, not document vectors.
>>>
>>> There is an additional step that can be run to produce document vectors.
>>>
>>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>>
>>>> compared to SVD, is the result is the "right singular value"?
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

What is done:

Input:
tf-idf-matrix (docs x terms) 6076937 x 20444

"SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
eigenvalues) of tf-idf-matrix, called:
svd (concepts x terms) 87 x 20444

transpose tf-idf-matrix:
tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

transpose svd:
svd-transpose (terms x concepts) 20444 x 87

matrix multiply:
tf-idf-matrix-transpose x svd-transpose = result
(terms x docs) x (terms x concepts) = (docs x concepts)

so... I do understand, that the "svd" here is not SVD from wikipedia.
It only does the Lanczos algorithm and some magic which produces the
> Instead either the left or right (but usually the right) eigenvectors premultiplied by the diagonal or the square root of the
> diagonal element.
from http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E

so my question: what is the output of the SVD in mahout. And what do I
have to calculate to get the "right singular value" from svd?

Thanks,
Stefan

2011/6/6 Stefan Wienert <st...@wienert.cc>:
> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>
> the last step is the matrix multiplication:
>  --arg --numRowsA --arg 20444 \
>  --arg --numColsA --arg 6076937 \
>  --arg --numRowsB --arg 20444 \
>  --arg --numColsB --arg 87 \
> so the result is a 6,076,937 x 87 matrix
>
> the input has 6,076,937 (each with 20,444 terms). so the result of
> matrix multiplication has to be the right singular value regarding to
> the dimensions.
>
> so the result is the "concept-document vector matrix" (as I think,
> these is also called "document vectors" ?)
>
> 2011/6/6 Ted Dunning <te...@gmail.com>:
>> Yes.  These are term vectors, not document vectors.
>>
>> There is an additional step that can be run to produce document vectors.
>>
>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <st...@wienert.cc> wrote:
>>
>>> compared to SVD, is the result is the "right singular value"?
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Stefan Wienert <st...@wienert.cc>.
https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

the last step is the matrix multiplication:
  --arg --numRowsA --arg 20444 \
  --arg --numColsA --arg 6076937 \
  --arg --numRowsB --arg 20444 \
  --arg --numColsB --arg 87 \
so the result is a 6,076,937 x 87 matrix

the input has 6,076,937 (each with 20,444 terms). so the result of
matrix multiplication has to be the right singular value regarding to
the dimensions.

so the result is the "concept-document vector matrix" (as I think,
these is also called "document vectors" ?)

2011/6/6 Ted Dunning <te...@gmail.com>:
> Yes.  These are term vectors, not document vectors.
>
> There is an additional step that can be run to produce document vectors.
>
> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <st...@wienert.cc> wrote:
>
>> compared to SVD, is the result is the "right singular value"?
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: Need a little help with SVD / Dimensional Reduction

Posted by Ted Dunning <te...@gmail.com>.
Yes.  These are term vectors, not document vectors.

There is an additional step that can be run to produce document vectors.

On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <st...@wienert.cc> wrote:

> compared to SVD, is the result is the "right singular value"?
>