You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by megaserg <gi...@git.apache.org> on 2017/08/18 05:58:01 UTC

[GitHub] spark pull request #18990: [SPARK-21782][Core] Repartition creates skews whe...

GitHub user megaserg opened a pull request:

    https://github.com/apache/spark/pull/18990

    [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2

    ## Problem
    When an RDD (particularly with a low item-per-partition ratio) is repartitioned to numPartitions = power of 2, the resulting partitions are very uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG only once. See details in https://issues.apache.org/jira/browse/SPARK-21782
    
    ## What changes were proposed in this pull request?
    Instead of using fixed seed, use a default constuctor for `Random`.
    
    ## How was this patch tested?
    `build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/megaserg/spark repartition-skew

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18990.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18990
    
----
commit 2cb7550b8ecada3c504621a75c4f82d13880496b
Author: Sergey Serebryakov <ss...@tesla.com>
Date:   2017-08-18T05:47:55Z

    [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18990: [SPARK-21782][Core] Repartition creates skews when numPa...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18990
  
    LGTM. I agree that in theory there is no reason we should depend on the exact shuffle distribution here. It should be beneficial to have a more even distribution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18990: [SPARK-21782][Core] Repartition creates skews when numPa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18990
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18990: [SPARK-21782][Core] Repartition creates skews when numPa...

Posted by megaserg <gi...@git.apache.org>.
Github user megaserg commented on the issue:

    https://github.com/apache/spark/pull/18990
  
    Sorry, I edited the pull request body. The @srowen's comment above was referring to the initial version, where I proposed using default, non-deterministic constructor for `Random()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18990: [SPARK-21782][Core] Repartition creates skews whe...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18990


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18990: [SPARK-21782][Core] Repartition creates skews when numPa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18990
  
    **[Test build #3891 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3891/testReport)** for PR 18990 at commit [`bee7fca`](https://github.com/apache/spark/commit/bee7fcaf0a3601ee933bf739f32b14d4abcdee30).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18990: [SPARK-21782][Core] Repartition creates skews when numPa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18990
  
    **[Test build #3891 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3891/testReport)** for PR 18990 at commit [`bee7fca`](https://github.com/apache/spark/commit/bee7fcaf0a3601ee933bf739f32b14d4abcdee30).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org