You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/09 10:04:21 UTC

[GitHub] [spark] WangGuangxin opened a new pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

WangGuangxin opened a new pull request #35460:
URL: https://github.com/apache/spark/pull/35460

### What changes were proposed in this pull request?
When we do shuffle on indeterminate expressions such as `rand`, and ShuffleFetchFailed happend, we may get incorrent result since it only retries failed map tasks.

To illustrate this, suppose we have a dataset with two columns `(range(1, 5) as a, rand() as b)`, we shuffle by `b` using two map tasks and two reduce tasks.
<img width="630" alt="image" src="https://user-images.githubusercontent.com/1312321/153171110-13887a78-565f-4cdf-afda-bb9053ea7ef0.png">

**When there is a fetch failed and we need to rerun map task 2, the generated partitions maybe different compared with last attempt, and finally we get a duplicate record with a = 4**
<img width="610" alt="image" src="https://user-images.githubusercontent.com/1312321/153172262-bf5e2c81-8db6-4b01-8b70-8f284b54b09d.png">

This is very similary to the bug in Repartition+Shuffle, which is fixed by https://github.com/apache/spark/pull/22112.
This PR try to fix this by reuse current machenism.

### Why are the changes needed?
Fix data inconsistent issue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org