You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Danny Leshem (JIRA)" <ji...@apache.org> on 2010/04/15 16:22:48 UTC

[jira] Issue Comment Edited: (MAHOUT-369) Issues with DistributedLanczosSolver output

    [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857329#action_12857329 ] 

Danny Leshem edited comment on MAHOUT-369 at 4/15/10 10:21 AM:
---------------------------------------------------------------

Attached is a simple patch that fixed the two issues raised.

The right way to do this is to introduce a new unit-test that fails with the current version (e.g. decompose a fixed matrix and verify all its known eigenvalues are found). The attached patch has no such code.

All relevant unit-tests pass (I'm getting errors for a few org.apache.mahout.clustering tests, nothing related to this change though).

Jake, I'm doing this mostly to practice sending Mahout patches. If your original code indeed implemented the intended logic, you might want to add a unit-test that fails my patch...

      was (Author: dleshem):
    Attached is a simple patch that fixed the two issues raised.

The right way to do this is to introduce a new unit-test that fails with the current version (e.g. decompose a fixed matrix and verify all its known eigenvalues are found). The attached patch has no such code.

All relevant unit-tests pass (I'm getting errors for a few org.apache.mahout.clustering tests, nothing related to this change though).
  
> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>         Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira