You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2012/09/29 07:11:08 UTC

[jira] [Created] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Jeff Eastman created MAHOUT-1086:
------------------------------------

             Summary: Mean Shift Test Now Produces 4 Clusters
                 Key: MAHOUT-1086
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.7
            Reporter: Jeff Eastman


Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-1086:
--------------------------------

    Attachment: 0001-MAHOUT-1086-Deal-with-round-off-errors-in-computing-.patch

Here is a patch that deals with the issue as well as related issues. It includes many new tests and increases coverage generally.
                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>         Attachments: 0001-MAHOUT-1086-Deal-with-round-off-errors-in-computing-.patch, 0001-MAHOUT-1086-Deal-with-round-off-errors-in-computing-.patch
>
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning resolved MAHOUT-1086.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8

Committed this patch.  This will set up MAHOUT-1091 for committing.

                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>             Fix For: 0.8
>
>         Attachments: 0001-MAHOUT-1086-Deal-with-round-off-errors-in-computing-.patch, 0001-MAHOUT-1086-Deal-with-round-off-errors-in-computing-.patch
>
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466480#comment-13466480 ] 

Ted Dunning commented on MAHOUT-1086:
-------------------------------------

OK.  I seem to be able replicate the problem with 

    trunk@1380432 MAHOUT-1059 - Abstract the idea of a cached length
    1cb76f01b9b504fdf33a7ef6e30afdbd7d3842ef

and not before.  This change also introduces some changes to AbstractVector that might be the issue.

The changes involved have to do with whether operations on sparse matrices operated sparsely.  For instance, like()
used to be this:
{code}
    Vector result = like().assign(this);
{code}
This causes a dense iteration which is wrong.  The new code does this instead:
{code}
    Vector result;
    if (isDense()) {
      result = like().assign(this);
    } else {
      result = like();
      Iterator<Element> i = this.iterateNonZero();
      while (i.hasNext()) {
        final Element element = i.next();
        result.setQuick(element.index(), element.get());
      }
    }
{code}
The idea is that if the source of the data is sparse, we only need to assign the non-zero elements since we know the newly minted destination will be zero filled.

My feeling is that this code is correct, but there is a more complex change later in the same diff that might have changed some results.

I will isolate these changes and see if I can determine what the changes were and how they impact canopy stuff.  
                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466361#comment-13466361 ] 

Ted Dunning commented on MAHOUT-1086:
-------------------------------------

Hmm... looks like my test script wasn't installing math so it was artificially failing.

More anon.
                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466360#comment-13466360 ] 

Ted Dunning commented on MAHOUT-1086:
-------------------------------------

I think that this is my issue.  According to git, it find that the first bad commit was svn@1379763 

{code}
9929fddbc542d43f7978cb45980ae3e8855b2278 is the first bad commit
commit 9929fddbc542d43f7978cb45980ae3e8855b2278
Author: Ted Dunning <td...@apache.org>
Date:   Sat Sep 1 14:03:22 2012 +0000

    MAHOUT-1063 - Integer and real attributes are handled just as any numeric attribute.
    
    git-svn-id: https://svn.apache.org/repos/asf/mahout/trunk@1379763 13f79535-47bb-0310-9956-ffa450edef68

:040000 040000 199b69ff766daded3774d0e8fa64cf59199e9af1 12e47b4f7f35512604c457c2dd18e40d8d0ae34a M	integration
bisect run success
Teds-MacBook-Pro-2:mahout-2[(no branch)*]$ git bisect log
git bisect start
# bad: [fa30e5fcb8f2ed8002a6676a673494273f63e679] MAHOUT-1059 - Remove memory hungry test
git bisect bad fa30e5fcb8f2ed8002a6676a673494273f63e679
# good: [556737f074a0ad97595a6b584e99bd020c9d8b23] more getters for a factorization (minor change)
git bisect good 556737f074a0ad97595a6b584e99bd020c9d8b23
# bad: [c91eba1ad197de6f61a010b880b6cfed671051d9] MAHOUT-1059 - Add generic vector test.
git bisect bad c91eba1ad197de6f61a010b880b6cfed671051d9
# bad: [ab514d84bfc4cb0c34b5a79d63d08df5f742dba0] MAHOUT-1063 - Add test case for ARFF integers and reals.
git bisect bad ab514d84bfc4cb0c34b5a79d63d08df5f742dba0
# bad: [9929fddbc542d43f7978cb45980ae3e8855b2278] MAHOUT-1063 - Integer and real attributes are handled just as any numeric attribute.
git bisect bad 9929fddbc542d43f7978cb45980ae3e8855b2278
Teds-MacBook-Pro-2:mahout-2[(no branch)*]$ 
{code}

I will look further.
                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-1086:
--------------------------------

    Attachment: 0001-MAHOUT-1086-Deal-with-round-off-errors-in-computing-.patch

Here is a patch in git format.  I can't remember if jenkins will apply that cleanly or not.

The problem was round-off errors that happened differently when the ordering of operations is different.  That ordering is changed when caching is changed and this problem comes up.

The expected result of this patch is that the old error will come back (reversed).  That is, you should see
{code}
Failed tests:   testCanopyEuclideanMRJobNoClustering(org.apache.mahout.clustering.meanshift.TestMeanShift): count expected:<4> but was:<3>
{code}

                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>         Attachments: 0001-MAHOUT-1086-Deal-with-round-off-errors-in-computing-.patch
>
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466500#comment-13466500 ] 

Ted Dunning commented on MAHOUT-1086:
-------------------------------------

OK. I think I have the problem isolated even if I don't understand it.  In getDistanceSquared, I separate out the computation of one operand's squared length and push it back into that operand for caching.  The code is
{code}
    Iterator<Element> it = sparseAccessed.iterateNonZero();
    double d = randomlyAccessed.getLengthSquared();
    double d2 = 0;
    double dot = 0;
    while (it.hasNext()) {
      Element e = it.next();
      double value = e.get();
      d2 += value * value;
      dot += value * randomlyAccessed.getQuick(e.index());
    }
    //assert d > -1.0e-9; // round-off errors should never be too far off!
    final double r1 = Math.abs(d + d2 - 2 * dot);
    final double r2 = oldDistanceSquared(v);
    final double error = Math.abs(r1 - r2) / r1;
    if (error > 1e-14) {
      System.err.printf("Discrepancy %.3f\n", error);
    }
//    if (sparseAccessed instanceof LengthCachingVector) {
//      ((LengthCachingVector) sparseAccessed).setLengthSquared(d2);
//    }
    return r2;
{code}
The commented code is where the cache is updated.  If these lines are commented, the problem does not happen.  If these lines are uncommented, it does happen.  

My problem here is that I can't yet understand what the problem is.  I also don't understand how this is different from what we had before.  I have also have put a test into the place that the cache is updated and don't see that saving this causes a problem.

I think that we have a problem where some other code somewhere is misusing this cache.  I am going to start a wide-ranging inspection to see what is going on.  That is going to take quite a while, especially since I am unlikely to have another full day to beat on this for a while.
                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466440#comment-13466440 ] 

Ted Dunning commented on MAHOUT-1086:
-------------------------------------

OK.  So this is a bit of a Heisenbug.  I am not seeing consistent failures.

I am about to run again with multiple replications so that success means five clean runs or some such.

Thank goodness bisection is "fast".
                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473820#comment-13473820 ] 

Hudson commented on MAHOUT-1086:
--------------------------------

Integrated in Mahout-Quality #1695 (See [https://builds.apache.org/job/Mahout-Quality/1695/])
    MAHOUT-1086 - Deal with round-off errors in computing L_2 distances.  Add special case to get higher accuracy when vector difference is small, merge AbstractVectorTest and AbstractTestVector, fix like() bug in Centroid and WeightedVector. (Revision 1396888)

     Result = SUCCESS
tdunning : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396888
Files : 
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/Centroid.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/DelegatingVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/DenseVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/LengthCachingVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/WeightedVector.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/AbstractTestVector.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/CentroidTest.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestDenseVector.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestRandomAccessSparseVector.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSequentialAccessSparseVector.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/VectorTest.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/WeightedVectorTest.java

                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>             Fix For: 0.8
>
>         Attachments: 0001-MAHOUT-1086-Deal-with-round-off-errors-in-computing-.patch, 0001-MAHOUT-1086-Deal-with-round-off-errors-in-computing-.patch
>
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1086) Mean Shift Test Now Produces 4 Clusters

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466654#comment-13466654 ] 

Ted Dunning commented on MAHOUT-1086:
-------------------------------------

OK.  The problem was round-off errors introduced by the new formulation.  The caching of squared lengths apparently changed which round-off errors occurred which triggered a change in the (excessively) sensitive clustering tests.

Patch coming shortly.
                
> Mean Shift Test Now Produces 4 Clusters
> ---------------------------------------
>
>                 Key: MAHOUT-1086
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1086
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>
> Something changed in Mahout around 9/6/12 that caused TestMeanShift.testCanopyEuclideanMRJobNoClustering to return 4 clusters rather than 3. All of the other tests using the same data still return 3 clusters. No changes were made to any of the MeanShiftCanopy classes other than 1 formatting change to the driver so I'm at a loss to the cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira