You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Danny Leshem (JIRA)" <ji...@apache.org> on 2010/04/07 15:31:33 UTC

[jira] Created: (MAHOUT-369) Issues with DistributedLanczosSolver output

Issues with DistributedLanczosSolver output
-------------------------------------------

                 Key: MAHOUT-369
                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
             Project: Mahout
          Issue Type: Bug
          Components: Math
    Affects Versions: 0.3, 0.4
            Reporter: Danny Leshem


DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
{code}
    log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
{code}

However, a few lines later (line 106) we have
{code}
    for(int i=0; i<eigenVectors.numRows() - 1; i++) {
        ...
    }
{code}

which only persists eigenVectors.numRows()-1 vectors.

Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?


Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.

This, for two reasons:
1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garabage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Danny Leshem (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Danny Leshem updated MAHOUT-369:
--------------------------------

    Fix Version/s: 0.4

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Danny Leshem (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857329#action_12857329 ] 

Danny Leshem edited comment on MAHOUT-369 at 4/15/10 10:21 AM:
---------------------------------------------------------------

Attached is a simple patch that fixed the two issues raised.

The right way to do this is to introduce a new unit-test that fails with the current version (e.g. decompose a fixed matrix and verify all its known eigenvalues are found). The attached patch has no such code.

All relevant unit-tests pass (I'm getting errors for a few org.apache.mahout.clustering tests, nothing related to this change though).

Jake, I'm doing this mostly to practice sending Mahout patches. If your original code indeed implemented the intended logic, you might want to add a unit-test that fails my patch...

      was (Author: dleshem):
    Attached is a simple patch that fixed the two issues raised.

The right way to do this is to introduce a new unit-test that fails with the current version (e.g. decompose a fixed matrix and verify all its known eigenvalues are found). The attached patch has no such code.

All relevant unit-tests pass (I'm getting errors for a few org.apache.mahout.clustering tests, nothing related to this change though).
  
> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>         Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Danny Leshem (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Danny Leshem updated MAHOUT-369:
--------------------------------

    Description: 
DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
{code}
    log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
{code}

However, a few lines later (line 106) we have
{code}
    for(int i=0; i<eigenVectors.numRows() - 1; i++) {
        ...
    }
{code}

which only persists eigenVectors.numRows()-1 vectors.

Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?


Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.

This, for two reasons:
1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

  was:
DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
{code}
    log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
{code}

However, a few lines later (line 106) we have
{code}
    for(int i=0; i<eigenVectors.numRows() - 1; i++) {
        ...
    }
{code}

which only persists eigenVectors.numRows()-1 vectors.

Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?


Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.

This, for two reasons:
1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garabage.


> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Danny Leshem (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Danny Leshem updated MAHOUT-369:
--------------------------------

    Attachment: MAHOUT-369.patch

Attached is a simple patch that fixed the two issues raised.

The right way to do this is to introduce a new unit-test that fails with the current version (e.g. decompose a fixed matrix and verify all its known eigenvalues are found). The attached patch has no such code.

All relevant unit-tests pass (I'm getting errors for a few org.apache.mahout.clustering tests, nothing related to this change though).

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>         Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Danny Leshem (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859855#action_12859855 ] 

Danny Leshem commented on MAHOUT-369:
-------------------------------------

Okay, please ignore my previous patch.

I think I finally understand what's going on with the code. There are two seemingly unrelated issues:

1) The code returns one less eigenvector/eigenvalue than requested.

LanczosSolver.java (157) returns one less eigenvector/eigenvalue than requested-
{code}
    for (int i = 0; i < basis.numRows() - 1; i++) {
{code}

DistributedLanczosSolver.java (106) serializes one less eigenvector than requested-
{code}
    for(int i=0; i<eigenVectors.numRows() - 1; i++) {
{code}

2) When asked for N eigenvectors, the code returns (serializes) the most important N-1 eigenvectors in *descending* order (meaning, important ones are serialized first - which, other than the "N-1" part, is excellent!), but associates them with the bottom N-1 out of the top N most important eigenvalues in *ascending* order (meaning, important ones are serialized last).

In other words, the #1 important eigenvalue is not serialized at all, the #2 important eigenvalue is associated with the #N-1 important eigenvector, ... and finally the #N most important eigenvalue is associated with the #1 important eigenvector.

Vector names changed drastically during the last few days (following Sean's patch for MAHOUT-379), so issue #2 is less obvious now. Previously the eigenvalues where serialized with the vector. Now they are only mentioned in log prints.

Here's some code to verify these claims. The code creates a 1000-rows 100-columns matrix. For each row, the first element is +1/-1 with equal probabilities, and the rest are +0.001/-0.001 with equal probabilities. The Elements are uncorrelated. So decomposing this matrix should reveal that the most important PC is the 100-dimensional vector (1, 0, 0, ... , 0), and it should be associated with a *much* higher eigenvalue than the rest. To test my claims, remove the "-1" from the two loops mentioned above, and run the following code:

[add to TestDistributedRowMatrix.java]
{code}
  public static DistributedRowMatrix distributedMatrix(final Matrix matrix,
                                                       String baseTmpDir) throws IOException {
    baseTmpDir = TESTDATA + baseTmpDir;
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);

    ClusteringTestUtils.writePointsToFile(new Iterable<VectorWritable>() {
      @Override
      public Iterator<VectorWritable> iterator() {
        final Iterator<MatrixSlice> it = matrix.iterator();
        return new Iterator<VectorWritable>() {
          @Override
          public boolean hasNext() { return it.hasNext(); }
          @Override
          public VectorWritable next() {
            MatrixSlice slice = it.next();
            return new VectorWritable(slice.vector());
          }
          @Override
          public void remove() { it.remove(); }
        };
      }
    }, true, baseTmpDir + "/distMatrix/part-00000", fs, conf);

    DistributedRowMatrix distMatrix = new DistributedRowMatrix(baseTmpDir + "/distMatrix",
                                                               baseTmpDir + "/tmpOut",
                                                               matrix.numRows(),
                                                               matrix.numCols());
    distMatrix.configure(new JobConf(conf));

    return distMatrix;
  }
{code}

[add to TestDistributedLanczosSolver.java, and run the test]
{code}
  private Random rand = new Random();

  private double[] randomRow(int numColumns) {
      final double[] values = new double[numColumns];

      // Only the first column's value is "important". Columns are uncorrelated.
      values[0] = (rand.nextBoolean() ? 1 : -1);
      for (int i = 1; i < numColumns; ++i) {
          values[i] = (rand.nextBoolean() ? 0.001 : -0.001);
      }

      return values;
  }

  private Matrix randomMatrix(int numRows, int numColumns) {
      final Matrix matrix = new DenseMatrix(numRows, numColumns);
      for (int row = 0; row < numRows; ++row) {
          matrix.set(row, randomRow(numColumns));
      }
      return matrix;
  }

  public void testDistributedLanczosSolverSanity() throws Exception {
    final File testData = new File("testdata");
    if (!testData.exists()) {
      testData.mkdir();
    }

    final Matrix matrix = randomMatrix(1000, 100);
    final DistributedRowMatrix distMatrix =
            TestDistributedRowMatrix.distributedMatrix(matrix, "testdata");
    distMatrix.configure(new JobConf());

    final DistributedLanczosSolver solver = new DistributedLanczosSolver();
    final Matrix eigenVectors = new DenseMatrix(30, 100);
    final List<Double> eigenValues = new ArrayList<Double>();
    solver.solve(distMatrix, 30, eigenVectors, eigenValues, false);

    System.out.println("--- Eigenvalues ---");
    printDoubleList(eigenValues);
    
    System.out.println("--- Eigenvectors ---");
    printMatrix(eigenVectors);
  }

  private void printDoubleList(List<Double> values) {
    for (double value : values) {
        System.out.print(value + "\t");
    }
    System.out.println("");
  }

  private void printMatrix(Matrix matrix) {
      for (int row = 0; row < matrix.numRows(); ++row) {
          for (int col = 0; col < matrix.numCols(); ++col) {
              System.out.print(matrix.get(row, col) + "\t");
          }
          System.out.println("");
      }
  }
{code}

The test does not fail the current code, but you can see what's wrong by its prints. To make this a real test, one can replace the printing with verifying that 1) the first eigenvalue is, say, at least 100 times bigger than the second; and 2) the returned eigenvectors in fact correspond to their eigenvalues. This would fail the current code.

I considered submitting another patch, but I'm no longer sure about the best way to fix this. However, I am positive now that this is a real issue that needs to be fixed.

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>         Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860690#action_12860690 ] 

Jake Mannix commented on MAHOUT-369:
------------------------------------

Danny, thanks for looking into this so carefully, I think you definitely have described what's going on in this now.  

I'm going to try and look at this some more in the next few days and clean it up a bit, and add in some of your suggested unit tests.

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>         Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855014#action_12855014 ] 

Jake Mannix commented on MAHOUT-369:
------------------------------------

Hold on that Sean, I made the loop like that for a reason.  I need to check in again and verify if/where it's wrong, but it was not an oversight, it has to do with the way the Colt code does EigenDecomposition.

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix reassigned MAHOUT-369:
----------------------------------

    Assignee: Jake Mannix

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>             Fix For: 0.4
>
>         Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854992#action_12854992 ] 

Sean Owen commented on MAHOUT-369:
----------------------------------

If the patch amounts to making that loop into:

   for (int i = eigenVectors.numRows() - 1; i >= 0; i--) {

Then I can commit that. But do the folks who understand this code agree it's the thing to do? seems so.

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-369) Issues with DistributedLanczosSolver output

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854971#action_12854971 ] 

Ted Dunning commented on MAHOUT-369:
------------------------------------

Can you create a suggested patch?

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.