You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/07/14 19:56:47 UTC

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1407

    SPARK-1215: Clustering: Index out of bounds error

    Bug fix for JIRA SPARK 1215: Clustering: Index out of bounds error
    
     https://issues.apache.org/jira/browse/SPARK-1215
    
    Solution: Print warning, and use duplicate cluster centers so that exactly k centers are returned.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1407.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1407
    
----
commit 97f2104bac2ab864c2a03f9a12a4b936557ae6d6
Author: Joseph Bradley <jo...@gmail.com>
Date:   2014-05-20T01:35:53Z

    added RDD::stratifiedSample method and associated unit tests in RDDSuite.  Method is built off of RDD::takeSample method.

commit 91e83338820158b96cda492668dbed5fff33f19b
Author: Joseph Bradley <jo...@gmail.com>
Date:   2014-05-20T01:36:30Z

    added RDD::stratifiedSample method documentation

commit d6f8913b7e370a82138b9c623754b32a59c21cf6
Author: Joseph Bradley <jo...@gmail.com>
Date:   2014-05-21T04:58:12Z

    updated stratifiedSample to be more scalable, keeping data in RDDs instead of collecting to the driver

commit 21eead6a412508b536358f4e557e2fab23c9c696
Author: Joseph Bradley <jo...@gmail.com>
Date:   2014-05-23T19:51:07Z

    updated stratifiedSample to use selection-rejection to select samples on each partition in 1 pass, rather than pre-selecting indices

commit 91f4b19702bc58a77d28316674eace881a81165f
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-09T21:37:25Z

    merging with new spark

commit c0cb5f0d8c6104e3eb6cfa44820ba00b81bc7262
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-11T18:43:17Z

    merging with updated spark

commit 7d1b812a720cffdefe78ddb6e641930e7ae4975b
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-11T18:47:41Z

    removed my coding test updates

commit 18e5c8ad740871be92c6d7b73f5d35e25641a734
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-12T01:12:44Z

    Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.

commit e2bf638c6b3e8cc9cec3362caddb2305109d4c0a
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-14T17:52:33Z

    Merge remote-tracking branch 'upstream/master'

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1407#discussion_r14926012
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
    @@ -61,6 +61,30 @@ class KMeansSuite extends FunSuite with LocalSparkContext {
         assert(model.clusterCenters.head === center)
       }
     
    +  test("no distinct points") {
    +    val data = sc.parallelize(Array(
    +      Vectors.dense(1.0, 2.0, 3.0),
    +      Vectors.dense(1.0, 2.0, 3.0),
    +      Vectors.dense(1.0, 2.0, 3.0)
    +    ))
    --- End diff --
    
    add `, 2` to test two partitions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-49085926
  
    QA tests have started for PR 1407. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16689/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215 [MLLIB]: Clustering: Index out of b...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-49347163
  
    QA results for PR 1407:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16785/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215 [MLLIB]: Clustering: Index out of b...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1407#discussion_r15073983
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala ---
    @@ -59,6 +59,11 @@ private[mllib] object LocalKMeans extends Logging {
             cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
             j += 1
           }
    +      if (j == 0) {
    +        logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
    +          s" Using duplicate point for center k = $i.")
    +        j = 1
    --- End diff --
    
    I'll go with the second suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215 [MLLIB]: Clustering: Index out of b...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley closed the pull request at:

    https://github.com/apache/spark/pull/1407


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-48949060
  
    QA tests have started for PR 1407. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16633/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215 [MLLIB]: Clustering: Index out of b...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-49347076
  
    QA tests have started for PR 1407. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16785/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-48960591
  
    QA results for PR 1407:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16633/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1407#discussion_r14926024
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
    @@ -61,6 +61,30 @@ class KMeansSuite extends FunSuite with LocalSparkContext {
         assert(model.clusterCenters.head === center)
       }
     
    +  test("no distinct points") {
    +    val data = sc.parallelize(Array(
    +      Vectors.dense(1.0, 2.0, 3.0),
    +      Vectors.dense(1.0, 2.0, 3.0),
    +      Vectors.dense(1.0, 2.0, 3.0)
    +    ))
    +    val center = Vectors.dense(1.0, 2.0, 3.0)
    +
    +    // Make sure code runs.
    +    var model = KMeans.train(data, k=2, maxIterations=1)
    +    assert(model.clusterCenters.size === 2)
    +  }
    +
    +  test("more clusters than points") {
    +    val data = sc.parallelize(Array(
    +      Vectors.dense(1.0, 2.0, 3.0),
    +      Vectors.dense(1.0, 3.0, 4.0)
    +    ))
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-49098158
  
    QA results for PR 1407:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16689/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-49012803
  
    @jkbradley The fix looks good to me except some minor style issues. Thanks for fixing it! Btw, please add `[MLLIB]` to the title so this is easy to find.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215 [MLLIB]: Clustering: Index out of b...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-49352230
  
    I tangled stuff in this PR, so I am closing it and resubmitting (with updates per mengxr's suggestions) as PR 1468: https://github.com/apache/spark/pull/1468


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-48934486
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-48948316
  
    Jenkins, add to whitelist and test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-49124516
  
    QA tests have started for PR 1407. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16712/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1407#discussion_r14925960
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala ---
    @@ -59,6 +59,11 @@ private[mllib] object LocalKMeans extends Logging {
             cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
             j += 1
           }
    +      if (j == 0) {
    +        logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
    +          s" Using duplicate point for center k = $i.")
    +        j = 1
    --- End diff --
    
    The code may be clearer if written in this way
    
    ~~~
    centers(i) = 
      if (j == 0) {
        logWarning("...")
        points(0).toDense
      } else {
        points(j - 1).toDense
      }
    ~~~
    
    or 
    
    ~~~
    if (j == 0) {
      logWarning("...")
      centers(i) = points(0).toDense
    } else {
      centers(i) = points(j - 1).toDense
    }
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1407#issuecomment-49130099
  
    QA results for PR 1407:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16712/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---