You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "jwang0306 (via GitHub)" <gi...@apache.org> on 2024/03/01 02:07:51 UTC

[PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]

jwang0306 opened a new pull request, #45348:
URL: https://github.com/apache/spark/pull/45348

### What changes were proposed in this pull request?

This PR proposes an improvement to reduce executor memory usage by broadcasting the generated code during whole stage codegen.

In certain internal workloads, we have observed instances where the generated code can be up to hundreds of MBs, serving as the last straw leading to executor OOM. To improve stability and handle such pathological cases, we make the `cleanedSource` (the generated code) a broadcast variable. Theoretically, this change would reduce memory usage per executor from (`cleanedSource` * num tasks) to (`cleanedSource` * 1), with the trade-off of network latency.

The feature is gated by a Spark conf `spark.sql.codegen.broadcastCleanedSourceThreshold`. This config sets a threshold to determine if we should use a broadcast variable for the generated code during WSCG. If it is set to a value less than 0, the feature is disabled, which is the default.

### Why are the changes needed?

Stability improvement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

#### Unit Test
Added a new unit test in `WholeStageCodegenSuite` for three cases:
1. threshold is -1, shouldn't broadcast since smaller than 0 means disabled.
2. threshold is a larger number, shouldn't broadcast, not yet exceeded.
3. threshold is 0, should broadcast since it's always smaller than generated code size.

```bash
~/spark$ build/sbt -java-home /usr/lib/jvm/java-17-openjdk-amd64 "sql/testOnly *WholeStageCodegenSuite -- -z SPARK-47238"
```

#### Manual Test
For the ease of demonstration, we created a query that generates a lot of "comments" which leads to large code size, and additionally configure the cluster with the following:
- set `spark.sql.codegen.comments=true` to enable verbose comments.
- set `spark.sql.adaptive.enabled=false` to ensure whole stage codegen.
- set `spark.driver.memory=10G` to give driver enough memory.
- set a larger core numbers (16) and a smaller overall memory (1G) for executor.

##### Build the code
```bash
~/spark$ build/sbt -mem 10000 -java-home /usr/lib/jvm/java-17-openjdk-amd64 package
```

##### Without Broadcast
1. create a local cluster, with the configs mentioned above.
```bash
~/spark$ ./bin/spark-shell --master local-cluster[2,16,1024] --conf spark.sql.codegen.comments=true --conf spark.sql.adaptive.enabled=false --conf spark.driver.memory=10G
```
2. run the artificial workload, executor OOM-ed.
```bash
scala> spark.sql(s"select ${(1 to 99).map(i => s"id as ${"name" * 300000}$i").mkString(", ")} from range(100)").collect()
```

##### With Broadcast
1. create a local cluster, with the configs mentioned above, and further set `spark.sql.codegen.broadcastCleanedSourceThreshold=0` to ensure it's broadcasting
```bash
~/spark$ ./bin/spark-shell --master local-cluster[2,16,1024] --conf spark.sql.codegen.comments=true --conf spark.sql.adaptive.enabled=false --conf spark.driver.memory=10G --conf spark.sql.codegen.broadcastCleanedSourceThreshold=0
```
2. run the same artificial workload, the query finished successfully.
```bash
scala> spark.sql(s"select ${(1 to 99).map(i => s"id as ${"name" * 300000}$i").mkString(", ")} from range(100)").collect()
```

### Was this patch authored or co-authored using generative AI tooling?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org