You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2010/09/30 09:34:35 UTC
[jira] Updated: (MAHOUT-369) Issues with DistributedLanczosSolver
output
[ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-369:
-----------------------------
Fix Version/s: 0.5
(was: 0.4)
> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
> Key: MAHOUT-369
> URL: https://issues.apache.org/jira/browse/MAHOUT-369
> Project: Mahout
> Issue Type: Bug
> Components: Math
> Affects Versions: 0.3, 0.4
> Reporter: Danny Leshem
> Assignee: Jake Mannix
> Fix For: 0.5
>
> Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
> log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
> for(int i=0; i<eigenVectors.numRows() - 1; i++) {
> ...
> }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.