You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Danny Bickson (JIRA)" <ji...@apache.org> on 2011/02/09 19:54:58 UTC

[jira] Commented: (MAHOUT-369) Issues with DistributedLanczosSolver output

    [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992656#comment-12992656 ] 

Danny Bickson commented on MAHOUT-369:
--------------------------------------

I just checked this patch and it is correct. There are some other minor problems.
1) The ordering of eigenvalues was the opposite than eigenvectors. But this the patch fixes.
2) The signs of the first and third eigenvectors are negative to the sign of matlab. The second eigenvalue has the correct sign.
3) When requesting a rank of 4, we get 3 eigenvalues... So it seems that the rank is always lower by one.


I have added a test function named estLanczosSolver2() to TestLanczosSolver.java (code below).
To run it, you need first to comment the line: //nextVector.assign(new Scale(1 / scaleFactor));
in LanczosSolver.java, so it is easier to compare the results to Matlab, without the normalization.

I further suggest to add an additional optional flag for avoiding normalization.


The factorized matrix is: 
>> full(A)

ans =

    3.1200   -3.1212   -3.0000
   -3.1110    1.5000    2.1212
   -7.0000   -8.0000   -4.0000


The eigenvalues are;
>> [a,b]=eig(full(A'*A))  

a =

    0.2132   -0.8010   -0.5593
   -0.5785    0.3578   -0.7330
    0.7873    0.4799   -0.3871


b =

    0.0314         0         0
         0   42.6176         0
         0         0  131.2553

Now I run the unit test testLanczosSolver2 and I get:
INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal auxiliary matrix.
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 0 found with eigenvalue 131.25526355941963
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 1 found with eigenvalue 42.61761063477249
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 2 found with eigenvalue 0.03137295830779152
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: LanczosSolver finished.

As you can see the eigenvalues are correct, but when I look at the eigenvectors I see that
V1 = [ 0.5593 0.7330 0.387] , where in matlab we get -V1 (the third column of the matrix a
above)

@
Test
  public void testLanczosSolver2() throws Exception {
    int numRows = 3; int numCols = 3;
    int numColumns = 3;
    SparseRowMatrix m = new SparseRowMatrix(new int[]{numRows, numCols});
    /**
     *     3.1200   -3.1212   -3.0000
          -3.1110    1.5000    2.1212
          -7.0000   -8.0000   -4.0000

     */
    m.set(0,0,3.12);
    m.set(0,1,-3.12121);
    m.set(0,2,-3);
    m.set(1,0,-3.111);
    m.set(1,1,1.5);
    m.set(1,2,2.12122);
    m.set(2,0,-7);
    m.set(2,1,-8);
    m.set(2,2,-4);

    int rank = 4;
    Matrix eigens = new DenseMatrix(rank, numColumns);
    long time = timeLanczos(m, eigens, rank, false);
    assertTrue("Lanczos taking too long!  Are you in the debugger? :)", time < 10000);
    //assertOrthonormal(eigens);
    //assertEigen(eigens, m, 0.1, false);
  }


Best, 

Danny Bickson


> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>              Labels: DistributedLanczosSolver, decomposer
>             Fix For: 0.5
>
>         Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira