You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jiangxb1987 <gi...@git.apache.org> on 2018/01/27 00:42:31 UTC

[GitHub] spark pull request #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD ...

GitHub user jiangxb1987 opened a pull request:

    https://github.com/apache/spark/pull/20414

    [SPARK-23243][SQL] Shuffle+Repartition on an RDD could lead to incorrect answers

    ## What changes were proposed in this pull request?
    
    The RDD repartition also uses the round-robin way to distribute data, this can also cause incorrect answers on RDD workload the similar way as in #20393
    
    However, the approach that fixes DataFrame.repartition() doesn't apply on the RDD repartition issue, because the input data can be non-comparable, as discussed in https://github.com/apache/spark/pull/20393#issuecomment-360912451
    
    Here, I propose a quick fix that distribute elements use their hashes, this will cause perf regression if you have highly skewed input data, but it will ensure result correctness. 
    
    ## How was this patch tested?
    
    Added test case in `RDDSuite` to ensure `RDD.repartition()` generate consistent answers.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jiangxb1987/spark rdd-repartition

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20414.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20414
    
----
commit 6910ed62c272bedfa251cab589bb52bed36be3ed
Author: Xingbo Jiang <xi...@...>
Date:   2018-01-27T00:34:24Z

    fix RDD.repartition()

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    @jiangxb1987 You are correct when the sizes of the map's are same.
    But if the map sizes are different, the resulting order can be different - which can happen when requests for additional memory follows different patterns on re-execution (trigger'ing spill).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    Thanks @mridulm, all great points!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    Hey I searched the `ExternalAppendOnlyMap` and here are the findings:
    The `ExternalAppendOnlyMap` claims it keeps the sorted content, but it actually uses a `HashComparator` that compare the elements by their hashes. Luckily, it sort the elements using TimSort which is stable, that means, even if there exists hash collisions, the output sequence should still be deterministic, as long as the inputs are (which we can achieve by modifying `ShuffleBlockFetcherIterator` per previous discussion).
    
    We may need to check for all the other places we may spill/compare objects to ensure we generate deterministic output sequence everywhere, though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    Ouch... Yea, we have to think out a way to make it deterministic under hash collisions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    @shivaram Thinking more, this might affect everything which does a zip (or variants/similar idioms like limit K, etc) on partition should be affected - with random + index in coalesce + shuffle=true being one special case.
    
    Essentially anything which assumes that order of records in a partition will always be the same - currently,
    * Reading from an external immutable source like hdfs, etc (including checkpoint)
    * Reading from block store
    * Sorted partitions 
    should guarantee this - others need not.
    
    The more I think about it, I like @sameeragarwal's suggestion in #20393, a general solution for this could be introduce deterministic output for shuffle fetch - when enabled takes a more expensive but repeatable iteration of shuffle fetch.
    
    This assumes that spark shuffle is always repeatable given same input (I am yet to look into this in detail when spills are involved - any thoughts @sameeragarwal ?), which could be an implementation detail; but we could make it a requirement for shuffle.
    
    Note that we might be able to avoid this additional cost for most of the current usecases (otherwise we would have faced this problem 2 major releases ago !); so actual user impact, hopefully, might not be as high.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    **[Test build #93558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93558/testReport)** for PR 20414 at commit [`6910ed6`](https://github.com/apache/spark/commit/6910ed62c272bedfa251cab589bb52bed36be3ed).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    In addition, any use of random in spark code will get affected by this - unless input is an idempotent source; even if random initialization is done predictably with the partition index (which we were doing here anyway).
    We might want to look at mllib and other places as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    **[Test build #86728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86728/testReport)** for PR 20414 at commit [`6910ed6`](https://github.com/apache/spark/commit/6910ed62c272bedfa251cab589bb52bed36be3ed).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org