You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/10 22:27:40 UTC

[GitHub] [spark] sunchao opened a new pull request, #38196: [SPARK-40703][SQL] Introduce shuffle on SinglePartition to improve parallelism

sunchao opened a new pull request, #38196:
URL: https://github.com/apache/spark/pull/38196

### What changes were proposed in this pull request?

This PR fixes a performance regression issue when one side of a join uses `HashPartitioning` with `ShuffleExchange` while the other side uses `SinglePartition`. In this case, Spark will re-shuffle the side with `HashPartitioning` and both sides will end up with only a single partition. This could hurt query performance a lot if the side with `HashPartitioning` contains a lot of input data.

### Why are the changes needed?

After SPARK-35703, when Spark sees that one side of the join has `ShuffleExchange` (meaning it needs to be shuffled anyways), and the other side doesn't, it'll try to avoid shuffling the side without `ShuffleExchange`. For instance:

```
ShuffleExchange(HashPartition(200)) <-> HashPartition(150)
```

will be converted into
```
ShuffleExchange(HashPartition(150)) <-> HashPartition(150)
```

However, when the side without `ShuffleExchange` is `SinglePartition`, like the following:
```
ShuffleExchange(HashPartition(150)) <-> SinglePartition
```

Spark will also do the same which causes the left-hand side to only use one partition. This can hurt job parallelism dramatically, especially when using DataSource V2, since `SinglePartition` is used by the V2 scan. On the other hand, it seems DataSource V1 won't be impacted much as it always report `UnknownPartitioning` in this situation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new unit tests in `EnsureRequirementsSuite`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org