You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2024/01/10 05:10:48 UTC
(spark) branch master updated: [SPARK-46648][SQL] Use `zstd` as the default ORC compression
This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new a3991b17e379 [SPARK-46648][SQL] Use `zstd` as the default ORC compression
a3991b17e379 is described below
commit a3991b17e3790aded41dea1160b50ac605275c81
Author: Dongjoon Hyun <do...@apache.org>
AuthorDate: Tue Jan 9 21:10:38 2024 -0800
[SPARK-46648][SQL] Use `zstd` as the default ORC compression
### What changes were proposed in this pull request?
This PR aims to use `zstd` as the default ORC compression.
Note that Apache ORC v2.0 also uses `zstd` as the default compression via [ORC-1577](https://issues.apache.org/jira/browse/ORC-1577).
The following was the presentation about the usage of ZStandard.
- _The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro_
- [Slides](https://www.slideshare.net/databricks/the-rise-of-zstandard-apache-sparkparquetorcavro)
- [Youtube](https://youtu.be/dTGxhHwjONY)
### Why are the changes needed?
In general, `ZStandard` is better in terms of the file size.
```
$ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-snappy/ --recursive --summarize --human-readable | tail -n1
Total Size: 2.8 GiB
$ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-zstd/ --recursive --summarize --human-readable | tail -n1
Total Size: 2.4 GiB
```
As a result, the performance is also better in general in the cloud storage .
```
$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location s3a://dongjoon/orc2/tpcds-sf-1-orc-snappy"
...
[info] Running benchmark: TPCDS Snappy
[info] Running case: q1
[info] Stopped after 2 iterations, 5712 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q1 2708 2856 210 0.2 5869.3 1.0X
[info] Running benchmark: TPCDS Snappy
[info] Running case: q2
[info] Stopped after 2 iterations, 7006 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q2 3424 3503 113 0.7 1533.9 1.0X
[info] Running benchmark: TPCDS Snappy
[info] Running case: q3
[info] Stopped after 2 iterations, 6577 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q3 3146 3289 202 0.9 1059.0 1.0X
[info] Running benchmark: TPCDS Snappy
[info] Running case: q4
[info] Stopped after 2 iterations, 36228 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q4 17592 18114 738 0.3 3375.5 1.0X
...
```
```
$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location s3a://dongjoon/orc2/tpcds-sf-1-orc-zstd"
[info] Running benchmark: TPCDS Snappy
[info] Running case: q1
[info] Stopped after 2 iterations, 5235 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q1 2496 2618 172 0.2 5409.7 1.0X
[info] Running benchmark: TPCDS Snappy
[info] Running case: q2
[info] Stopped after 2 iterations, 6765 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q2 3338 3383 63 0.7 1495.6 1.0X
[info] Running benchmark: TPCDS Snappy
[info] Running case: q3
[info] Stopped after 2 iterations, 5882 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q3 2820 2941 172 1.1 949.1 1.0X
[info] Running benchmark: TPCDS Snappy
[info] Running case: q4
[info] Stopped after 2 iterations, 32925 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q4 16315 16463 208 0.3 3130.5 1.0X
...
```
### Does this PR introduce _any_ user-facing change?
Yes, the default ORC compression is changed.
### How was this patch tested?
Pass the CIs.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #44654 from dongjoon-hyun/SPARK-46648.
Authored-by: Dongjoon Hyun <do...@apache.org>
Signed-off-by: Dongjoon Hyun <dh...@apple.com>
---
docs/sql-data-sources-orc.md | 2 +-
docs/sql-migration-guide.md | 1 +
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +-
3 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/docs/sql-data-sources-orc.md b/docs/sql-data-sources-orc.md
index 561f601aa4e5..abd1901d24e4 100644
--- a/docs/sql-data-sources-orc.md
+++ b/docs/sql-data-sources-orc.md
@@ -240,7 +240,7 @@ Data source options of ORC can be set via:
</tr>
<tr>
<td><code>compression</code></td>
- <td><code>snappy</code></td>
+ <td><code>zstd</code></td>
<td>compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, lzo, zstd and lz4). This will override <code>orc.compress</code> and <code>spark.sql.orc.compression.codec</code>.</td>
<td>write</td>
</tr>
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 30a37d97042a..dbb25e5adc04 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -36,6 +36,7 @@ license: |
- `spark.sql.parquet.int96RebaseModeInRead` instead of `spark.sql.legacy.parquet.int96RebaseModeInRead`
- `spark.sql.avro.datetimeRebaseModeInWrite` instead of `spark.sql.legacy.avro.datetimeRebaseModeInWrite`
- `spark.sql.avro.datetimeRebaseModeInRead` instead of `spark.sql.legacy.avro.datetimeRebaseModeInRead`
+- Since Spark 4.0, the default value of `spark.sql.orc.compression.codec` is changed from `snappy` to `zstd`. To restore the previous behavior, set `spark.sql.orc.compression.codec` to `snappy`.
## Upgrading from Spark SQL 3.4 to 3.5
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index d1ac061f02af..1928e74363cb 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -1211,7 +1211,7 @@ object SQLConf {
.stringConf
.transform(_.toLowerCase(Locale.ROOT))
.checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo", "zstd", "lz4"))
- .createWithDefault("snappy")
+ .createWithDefault("zstd")
val ORC_IMPLEMENTATION = buildConf("spark.sql.orc.impl")
.doc("When native, use the native version of ORC support instead of the ORC library in Hive. " +
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org