You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2024/01/10 05:10:48 UTC

(spark) branch master updated: [SPARK-46648][SQL] Use `zstd` as the default ORC compression

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new a3991b17e379 [SPARK-46648][SQL] Use `zstd` as the default ORC compression
a3991b17e379 is described below

commit a3991b17e3790aded41dea1160b50ac605275c81
Author: Dongjoon Hyun <do...@apache.org>
AuthorDate: Tue Jan 9 21:10:38 2024 -0800

    [SPARK-46648][SQL] Use `zstd` as the default ORC compression
    
    ### What changes were proposed in this pull request?
    
    This PR aims to use `zstd` as the default ORC compression.
    
    Note that Apache ORC v2.0 also uses `zstd` as the default compression via [ORC-1577](https://issues.apache.org/jira/browse/ORC-1577).
    
    The following was the presentation about the usage of ZStandard.
    - _The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro_
        - [Slides](https://www.slideshare.net/databricks/the-rise-of-zstandard-apache-sparkparquetorcavro)
        - [Youtube](https://youtu.be/dTGxhHwjONY)
    
    ### Why are the changes needed?
    
    In general, `ZStandard` is better in terms of the file size.
    ```
    $ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-snappy/ --recursive --summarize --human-readable | tail -n1
       Total Size: 2.8 GiB
    
    $ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-zstd/ --recursive --summarize --human-readable | tail -n1
       Total Size: 2.4 GiB
    ```
    
    As a result, the performance is also better in general in the cloud storage .
    
    ```
    $ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
    build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location s3a://dongjoon/orc2/tpcds-sf-1-orc-snappy"
    ...
    [info] Running benchmark: TPCDS Snappy
    [info]   Running case: q1
    [info]   Stopped after 2 iterations, 5712 ms
    [info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
    [info] Apple M1 Max
    [info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] q1                                                 2708           2856         210          0.2        5869.3       1.0X
    [info] Running benchmark: TPCDS Snappy
    [info]   Running case: q2
    [info]   Stopped after 2 iterations, 7006 ms
    [info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
    [info] Apple M1 Max
    [info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] q2                                                 3424           3503         113          0.7        1533.9       1.0X
    [info] Running benchmark: TPCDS Snappy
    [info]   Running case: q3
    [info]   Stopped after 2 iterations, 6577 ms
    [info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
    [info] Apple M1 Max
    [info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] q3                                                 3146           3289         202          0.9        1059.0       1.0X
    [info] Running benchmark: TPCDS Snappy
    [info]   Running case: q4
    [info]   Stopped after 2 iterations, 36228 ms
    [info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
    [info] Apple M1 Max
    [info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] q4                                                17592          18114         738          0.3        3375.5       1.0X
    ...
    ```
    
    ```
    $ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
    build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location s3a://dongjoon/orc2/tpcds-sf-1-orc-zstd"
    [info] Running benchmark: TPCDS Snappy
    [info]   Running case: q1
    [info]   Stopped after 2 iterations, 5235 ms
    [info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
    [info] Apple M1 Max
    [info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] q1                                                 2496           2618         172          0.2        5409.7       1.0X
    [info] Running benchmark: TPCDS Snappy
    [info]   Running case: q2
    [info]   Stopped after 2 iterations, 6765 ms
    [info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
    [info] Apple M1 Max
    [info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] q2                                                 3338           3383          63          0.7        1495.6       1.0X
    [info] Running benchmark: TPCDS Snappy
    [info]   Running case: q3
    [info]   Stopped after 2 iterations, 5882 ms
    [info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
    [info] Apple M1 Max
    [info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] q3                                                 2820           2941         172          1.1         949.1       1.0X
    [info] Running benchmark: TPCDS Snappy
    [info]   Running case: q4
    [info]   Stopped after 2 iterations, 32925 ms
    [info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
    [info] Apple M1 Max
    [info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] q4                                                16315          16463         208          0.3        3130.5       1.0X
    ...
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the default ORC compression is changed.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #44654 from dongjoon-hyun/SPARK-46648.
    
    Authored-by: Dongjoon Hyun <do...@apache.org>
    Signed-off-by: Dongjoon Hyun <dh...@apple.com>
---
 docs/sql-data-sources-orc.md                                            | 2 +-
 docs/sql-migration-guide.md                                             | 1 +
 sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/sql-data-sources-orc.md b/docs/sql-data-sources-orc.md
index 561f601aa4e5..abd1901d24e4 100644
--- a/docs/sql-data-sources-orc.md
+++ b/docs/sql-data-sources-orc.md
@@ -240,7 +240,7 @@ Data source options of ORC can be set via:
   </tr>
   <tr>
     <td><code>compression</code></td>
-    <td><code>snappy</code></td>
+    <td><code>zstd</code></td>
     <td>compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, lzo, zstd and lz4). This will override <code>orc.compress</code> and <code>spark.sql.orc.compression.codec</code>.</td>
     <td>write</td>
   </tr>
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 30a37d97042a..dbb25e5adc04 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -36,6 +36,7 @@ license: |
   - `spark.sql.parquet.int96RebaseModeInRead` instead of `spark.sql.legacy.parquet.int96RebaseModeInRead`
   - `spark.sql.avro.datetimeRebaseModeInWrite` instead of `spark.sql.legacy.avro.datetimeRebaseModeInWrite`
   - `spark.sql.avro.datetimeRebaseModeInRead` instead of `spark.sql.legacy.avro.datetimeRebaseModeInRead`
+- Since Spark 4.0, the default value of `spark.sql.orc.compression.codec` is changed from `snappy` to `zstd`. To restore the previous behavior, set `spark.sql.orc.compression.codec` to `snappy`.
 
 ## Upgrading from Spark SQL 3.4 to 3.5
 
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index d1ac061f02af..1928e74363cb 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -1211,7 +1211,7 @@ object SQLConf {
     .stringConf
     .transform(_.toLowerCase(Locale.ROOT))
     .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo", "zstd", "lz4"))
-    .createWithDefault("snappy")
+    .createWithDefault("zstd")
 
   val ORC_IMPLEMENTATION = buildConf("spark.sql.orc.impl")
     .doc("When native, use the native version of ORC support instead of the ORC library in Hive. " +


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org