You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2014/07/03 19:11:22 UTC

[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/1293

    [SPARK-2355] Add checker for the number of clusters

    When the number of clusters given to perform with org.apache.spark.mllib.clustering.KMeans under parallel initial mode is greater than data number, it will throw ArrayIndexOutOfBoundsException.
    
    This PR adds checker for the number of clusters and throws IllegalArgumentException when that number is greater than data number.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 check_clusters_number

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1293.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1293
    
----
commit 582cd11e5331a8e2704a5603080eec41c9002cf4
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2014-07-03T16:27:22Z

    simply add checker for the number of clusters.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2355][MLlib] Add checker for the number...

Posted by viirya <gi...@git.apache.org>.
Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/1293


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-47959115
  
    data.count() is actually a very expensive operation, as it has to scan all the data. If cached, it may not be as much a problem, but it is still probably not worth it for this check. Which part throws the exception?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2355][MLlib] Add checker for the number...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-53340876
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2355][MLlib] Add checker for the number...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-53944261
  
    @viirya Do you mind closing this JIRA for now? We should handle this corner case more gracefully. Instead of throwing an exception, we could output less than k cluster centers. SPARK-3218 was created to track this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-50691878
  
    @viirya could you add `[MLlib]` to the title of this pull request? Otherwise it doesn't get sorted correctly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-47958358
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2355][MLlib] Add checker for the number...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-53341142
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19157/consoleFull) for   PR 1293 at commit [`582cd11`](https://github.com/apache/spark/commit/582cd11e5331a8e2704a5603080eec41c9002cf4).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-47961640
  
    The problem lies in `initKMeansParallel`, the implementation of k-means|| algorithm. Since it selects at most the centers as many as the data number, when calling `LocalKMeans.kMeansPlusPlus` at the end of `initKMeansParallel`, `kMeansPlusPlus` would throw this exception.
    
    I can slightly modify `kMeansPlusPlus` to avoid this exception by selected chosen centers to fill the gap between cluster numbers and data number. But this approach might not be appropriate because it is not the problem of the algorithm.
    
    I also think about whether it is worth to check that by scanning all data. But since it is only counting and no other computations involved, it might be acceptable still. In fact, there are also many map operations on the data later in clustering. Comparing with these map ops, `data.count()` should be lightweight more? Or it is unnecessary to check that? Any suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2355][MLlib] Add checker for the number...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-53945559
  
    @mengxr OK. Thanks for informing. SPARK-3218 looks promising.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2355][MLlib] Add checker for the number...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-53340870
  
    @mengxr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2355][MLlib] Add checker for the number...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-53347182
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19157/consoleFull) for   PR 1293 at commit [`582cd11`](https://github.com/apache/spark/commit/582cd11e5331a8e2704a5603080eec41c9002cf4).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org