You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/27 22:46:42 UTC

[GitHub] [spark] venkata91 opened a new pull request #34122: [WIP][SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle

venkata91 opened a new pull request #34122:
URL: https://github.com/apache/spark/pull/34122

### What changes were proposed in this pull request?

Currently shuffle mergers are fetched before the start of the ShuffleMapStage. But for initial stages this can be problematic as shuffle mergers are nothing but unique hosts with shuffle services running which could be very few based on executors and this can cause merge ratio to be low.

With this approach, `ShuffleMapTask` query for merger locations if not available and if available and start using this for pushing the blocks. Since partitions are mapped uniquely to a merger location, it should be fine to not push for the earlier set of tasks. This should improve the merge ratio for even initial stages.

Note: Currently this is in WIP because the changes are on top of SPARK-33701, once that gets merged will remove those changes updated it and remove WIP tag.

### Why are the changes needed?

Performance improvement. No new APIs change.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit tests and also has been working in our internal production environment for a while now.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org