You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Dan Brickley <da...@danbri.org> on 2011/02/25 16:37:18 UTC

Understanding Mahout's Hadoop SVD results - eigenvectors/values vs decomposed matrices

So - it's always a little embarrassing saying "I don't understand",
but here goes. I can't claim to have strong linear math skills, and
don't mind admitting that, but I've (hopefully) a rough idea what's
going on. I have got now to the basic stage where I've at least put a
matrix into the hadoop SVD/Lanczos implementation (ie.
https://issues.apache.org/jira/browse/MAHOUT-180)  and got something
out again. But then I hit a wall...

My problem is that I was imagining the results would be three factor'd
matrixes (which when multiplied would reproduce the original, and from
which I could take left-most columns per various SVD tutorials).
Instead, I get:

11/02/25 10:03:11 INFO decomposer.DistributedLanczosSolver: Persisting
10 eigenVectors and eigenValues to: outpath/rawEigenvectors

which when unpacked with
http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization-reading.html
gives me

key 0 value: {0:-0.5695508206727358,1:-0.4285601649419706,2:-0.3882489326234163,3:-0.584132531205635}
key 1 value: {0:-0.2721655269759087,2:0.13608276348795434,3:-0.9525793444156804}
key 2 value: {0:0.022062855712982308,1:0.7148365251006306,2:0.6927736693876481,3:0.09266399399452617}
key 3 value: {0:0.783849515338196,1:0.3919247576690979,2:-0.3919247576690983,3:-0.2799462554779274}
key 4 value: {0:0.557690858476082,1:-0.5791405068790082,2:0.5898653310804713,3:-0.0750737694102418}
key 5 value: {0:-0.22447685082502516,1:-0.5158243682948284,2:0.8253228908193994,3:0.04875951606025693}
key 6 value: {0:0.13483997249264842,1:-0.13483997249264842,2:-0.9438798074485389,3:0.26967994498529685}
key 7 value: {0:-0.6758100682735698,1:0.693089266667016,2:0.2503084269657081,3:-0.01592832197702688}
key 8 value: {0:0.4104908741187378,1:0.26436466202790915,2:0.32473155620641553,3:-0.8100357918881508}

....i.e. a single grid of values. Now
http://en.wikipedia.org/wiki/Singular_value_decomposition#Relation_to_eigenvalue_decomposition
and http://www.scribd.com/doc/7017586/Gorrell-Webb tell me that these
are intimately related to the SVD 3 matrices, however for a novice the
connection isn't entirely clear.

I'll copy details of the specific job / data I tried below, but the
basic issue is I guess more of documentation for tool-oriented rather
than math-oriented users. So consider this a case study in
misunderstanding. If the answer is "you need to (re)learn a bit more
maths", that's a fine outcome. If I get my head around this I'll try
to reflect what I learn back into the Wiki.

So I was inspired to dig into SVD by running across a few friendly
tutorials like http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/
and I tried to stick with their example for my original test, also to
walk through it in Matlab/Octave.

So my matrix was (in matlab-ese), from their 'Family Guy Seasons x
Users', where the elements were specific ratings by users for seasons:

A = [5,5,0,5; 5,0,3,4; 3,4,0,3; 0,0,5,3; 5,4,4,5; 5,4,5,5]

I converted it to Mahout binary using the tool at
http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
with the following input csv data:

0,0,5.0
0,1,5.0
0,3,5.0
1,0,5.0
1,2,3.0
1,3,4.0
2,0,3.0
2,1,4.0
2,3,3.0
3,2,5.0
3,3,3.0
4,0,5.0
4,1,4.0
4,2,4.0
4,3,5.0
5,0,5.0
5,1,4.0
5,2,5.0
5,3,5.0

(note I just skip the zero'd elements; is that appropriate/correct?)

On the hadoop cluster I blundered my way into the following:

hadoop jar ./mahout-examples-0.5-SNAPSHOT-job.jar \
  org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver \
  --input svdoutput.mht  --output outpath --numRows 6 --numCols 4 --rank 10

...which is where I got the values given at start of this mail. I've
poked around in octave with
http://www.mathworks.com/help/techdoc/ref/eig.html and
http://www.mathworks.com/help/techdoc/ref/svd.html but I've really hit
my limit here I think.

Thanks for any pointers or other advice,

cheers,

Dan

ps. re wiki documentation, how do you all feel about continuing to use
the example in http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/
? maybe would be good to have matlab equivalents in there too?

Re: Understanding Mahout's Hadoop SVD results - eigenvectors/values vs decomposed matrices

Posted by Derek O'Callaghan <de...@ucd.ie>.

Hi Ted,

If the eigenvectors are premultiplied by the diagonal/eigenvalues before 
they're persisted, I guess this the reason why it's okay for the 
eigenvalues not to be persisted separately? From looking at the code, 
persistence takes place in two places:

    * DistributedLanczosSolver.serializeOutput(): although the message
      "Persisting X eigenVectors and eigenValues to..." would appear to
      signify that the eigenvalues are persisted. The eigenValues List
      is in fact ignored here.
    * EigenVerificationJob.saveCleanEigens() (--cleansvd argument has
      been specified): this persists EigenVector instances, which are
      subclasses of DenseVector. The eigenValue and cosAngleError are
      stored within a EigenVector.name String, and some parse methods
      exist to retrieve them. However, in practice, both of these values
      are ignored as VectorWritable.writeVector() appears to save
      EigenVectors as DenseVectors.

If it was intended that the eigenvalues should actually be persisted, 
then it looks like there may be a bug here.

Regards,

Derek

On 25/02/11 17:36, Ted Dunning wrote:
> Generally the SVD in these sorts of situations does not return the entire
> set of three matrices.  Instead either the left or right (but usually the
> right) eigenvectors premultiplied by the diagonal or the square root of the
> diagonal element.
>
> I can't comment specifically on your situation, but hopefully knowing that
> the algorithm isn't supposed to return all three matrices will help you.
>
> On Fri, Feb 25, 2011 at 7:37 AM, Dan Brickley<da...@danbri.org>  wrote:
>
>> My problem is that I was imagining the results would be three factor'd
>> matrixes (which when multiplied would reproduce the original, and from
>> which I could take left-most columns per various SVD tutorials).
>>

Re: Understanding Mahout's Hadoop SVD results - eigenvectors/values vs decomposed matrices

Posted by Dan Brickley <da...@danbri.org>.

Hi Ted,

On 25 February 2011 18:36, Ted Dunning <te...@gmail.com> wrote:
> Generally the SVD in these sorts of situations does not return the entire
> set of three matrices.  Instead either the left or right (but usually the
> right) eigenvectors premultiplied by the diagonal or the square root of the
> diagonal element.
> I can't comment specifically on your situation, but hopefully knowing that
> the algorithm isn't supposed to return all three matrices will help you.

That does indeed help! Clearly plenty more to learn... Also I found a
JIRA issue reporting some oddities in the lanczos output (eg.
returning one less result than requested). So I'm creeping up on
getting it. But there's certainly a gap still, I'd be grateful if
anyone can help bridge things back  to
http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/
-ish territory, and if we get that far I'll put some time into making
a Wiki writeup with friendly examples...

Dan

Re: Understanding Mahout's Hadoop SVD results - eigenvectors/values vs decomposed matrices

Posted by Ted Dunning <te...@gmail.com>.

Generally the SVD in these sorts of situations does not return the entire
set of three matrices.  Instead either the left or right (but usually the
right) eigenvectors premultiplied by the diagonal or the square root of the
diagonal element.

I can't comment specifically on your situation, but hopefully knowing that
the algorithm isn't supposed to return all three matrices will help you.

On Fri, Feb 25, 2011 at 7:37 AM, Dan Brickley <da...@danbri.org> wrote:

> My problem is that I was imagining the results would be three factor'd
> matrixes (which when multiplied would reproduce the original, and from
> which I could take left-most columns per various SVD tutorials).
>