You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HuJiayin <gi...@git.apache.org> on 2015/09/01 05:03:45 UTC

[GitHub] spark pull request: SPARK-10329

GitHub user HuJiayin opened a pull request:

    https://github.com/apache/spark/pull/8546

    SPARK-10329

    Kmeans || is better to find centers more efficient based on stochastic processes,et al. But some users with small memory will meet difficulties to run this. The patch will fallback to old method in this case. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HuJiayin/spark json

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8546.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8546
    
----
commit ff29696106453b62c6d15d0ba35b8c55282f1c70
Author: HuJiayin <ji...@intel.com>
Date:   2015-09-01T02:57:05Z

    SPARK-10329

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329

Posted by HuJiayin <gi...@git.apache.org>.

Github user HuJiayin commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-136566441
  
    cc @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329 Cost RDD in k-means|| initializati...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-141347838
  
    @HuJiayin This basically reverts the behavior back to 1.2. The changes we made in 1.3 is to avoid recomputing distances between old centers and input points during initialization. That is why we need `newCenters`. If you test the current version with a large `k`, you will see the performance difference. Base on our discussion offline, I think there are not much work to do here. The case when the new implementation introduces overhead is when the dataset is really tall and skinny, but we haven't heard negative feedback from practical use cases yet. Do you mind closing this PR for now? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8546#discussion_r38401804
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -246,10 +248,23 @@ class KMeans private (
             if (initializationMode == KMeans.RANDOM) {
               initRandom(data)
             } else {
    -          initKMeansParallel(data)
    +          try {
    +            val exeMemSize = sc.getConf.get("spark.executor.memory")
    +            val memDim = exeMemSize.charAt(exeMemSize.length - 1)
    +            if ((memDim.equals('g') || memDim.equals('G')) && exeMemSize.length < 3 ||
    --- End diff --
    
    Eh, can you explain what this is? I don't think this even handles all cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-136652392
  
    I personally think it's brittle to arbitrarily change behavior based on one executor's memory size. It seems like this suggests you can only use the current implementation with >= 10g of memory, which doesn't sound right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329

Posted by HuJiayin <gi...@git.apache.org>.

Github user HuJiayin commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-138047275
  
    The user doesn't need enlarge the memory to run 1.5 kmeans after apply this fix. They still can use 1.2 configuration and have stable run experience in the same time. 
    
      


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8546#discussion_r38401811
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -246,10 +248,23 @@ class KMeans private (
             if (initializationMode == KMeans.RANDOM) {
               initRandom(data)
             } else {
    -          initKMeansParallel(data)
    +          try {
    +            val exeMemSize = sc.getConf.get("spark.executor.memory")
    +            val memDim = exeMemSize.charAt(exeMemSize.length - 1)
    +            if ((memDim.equals('g') || memDim.equals('G')) && exeMemSize.length < 3 ||
    +                (memDim.equals('m') || memDim.equals('M')) && exeMemSize.length < 6 ||
    +                (memDim.equals('k') || memDim.equals('K')) && exeMemSize.length < 9) {
    +              initKMeansParallelFallBack(data)
    +            } else {
    +              initKMeansParallel(data)
    +            }
    +          } catch {
    +            case _: NoSuchElementException => initKMeansParallel(data)
    --- End diff --
    
    Why this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329 Cost RDD in k-means|| initializati...

Posted by HuJiayin <gi...@git.apache.org>.

Github user HuJiayin closed the pull request at:

    https://github.com/apache/spark/pull/8546


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329 Cost RDD in k-means|| initializati...

Posted by HuJiayin <gi...@git.apache.org>.

Github user HuJiayin commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-140959166
  
    @mengxr I reduced centers storage and deleted the fallback and duplicate codes. I tested the functionality, performance on my local side and works. Could you give me a code format template for Intellij, I found my code will have many blank spaces after reformat code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329

Posted by HuJiayin <gi...@git.apache.org>.

Github user HuJiayin commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-136911191
  
    Users can manually adjust memory when they meet failure but they may cost some time to find the root cause. There are many ways to implement kmeans and they work well in different cases. I think it maybe better if the switch is transparent to users. The codes can switch between different approaches.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-136561196
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329

Posted by HuJiayin <gi...@git.apache.org>.

Github user HuJiayin commented on the pull request:

https://github.com/apache/spark/pull/8546#issuecomment-136908702

Kmeans|| is a better algorithms to find the centers, if user has sufficient memory, the performance is better. But sometime because of Kmeans parameters like K, samples settings and user case related, the algorithms will induce much duplication calculation and cause a large memory consumption in each node.

I tried the 8G executor memory in each node and 10 slaves in yarn mode. (The dimension is 20, k=5, and then k=50000 samples=1.2billion remains the same.) The task failed with RDD lost after reduced the RDD storage in Kmeans|| but can get the result by using old method. When I enlarged the memory to 20G, the Kmeans|| achieve the same performance with old method.

The different between kmeans|| and old method is not significant, If user data is small and set a small executor memory. If user set small executor memory in each node and have a large number of vm or nodes, I think the networking cost will raise. It seems not very practicable.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329 Cost RDD in k-means|| initializati...

Posted by HuJiayin <gi...@git.apache.org>.

Github user HuJiayin commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-141855212
  
    On the other hand, newcenters will cause a sudden increasing of memory usage, though call clear immediately, but i think it waits for GC to clear. Newcenter will still merge in some duplicate centers. I heard some clusters in a large internet company here set 4-5G yarn executor memory/node and has related usage. You can give a instruction if they meet similar issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-10329

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-140283221
  
    @HuJiayin Could you add a description to the PR title? The  JIRA number doesn't describe the content. For the implementation, maybe we can avoid duplicating code. How about not caching the cost RDD if the number of features is small? It should be able to reduce the storage overhead, but it might be slower than before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org