You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "beliefer (via GitHub)" <gi...@apache.org> on 2023/10/10 12:39:06 UTC

[PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

beliefer opened a new pull request, #43310:
URL: https://github.com/apache/spark/pull/43310

   ### What changes were proposed in this pull request?
   https://github.com/apache/spark/pull/41507 supported the new parquet compression codec `lz4raw`. But `lz4raw` is not a correct parquet compression codec name.
   
   We should use `lz4_raw` as its name.
   
   
   ### Why are the changes needed?
   Fix the bug that uses incorrect parquet compression codec `lz4raw`.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'Yes'.
   Fix a bug.
   
   
   ### How was this patch tested?
   New test cases.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   'No'.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "wangyum (via GitHub)" <gi...@apache.org>.

wangyum commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354289794


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -1014,12 +1014,12 @@ object SQLConf {
       "`parquet.compression` is specified in the table-specific options/properties, the " +
       "precedence would be `compression`, `parquet.compression`, " +
       "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +
-      "snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")
+      "snappy, gzip, lzo, brotli, lz4, lz4_raw, zstd.")

Review Comment:
   What is the suffix of the file after changing to `LZ4_RAW`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43310:
URL: https://github.com/apache/spark/pull/43310#issuecomment-1765534524

   https://github.com/apache/spark/pull/43330 merged.
   @dongjoon-hyun @wangyum @LuciferYang Could you review this PR again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354142346


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -1014,12 +1014,12 @@ object SQLConf {
       "`parquet.compression` is specified in the table-specific options/properties, the " +
       "precedence would be `compression`, `parquet.compression`, " +
       "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +
-      "snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")
+      "snappy, gzip, lzo, brotli, lz4, lz4_raw, zstd.")
     .version("1.1.1")
     .stringConf
     .transform(_.toLowerCase(Locale.ROOT))
     .checkValues(
-      Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4raw", "zstd"))
+      Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4_raw", "zstd"))

Review Comment:
   I know that. But 3.5.0 released latest, could we fix it ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43310:
URL: https://github.com/apache/spark/pull/43310#issuecomment-1757165766

   > Yes, I guess we can deprecated it at Spark 3.5.1 with a warning and remove it at Apache Spark 4.0.0. Since it's a fairly new config value, there will be not many users.
   
   I will create a PR for branch-3.5 to deprecate `lz4raw` and add `lz4_raw`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354137694


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -1014,12 +1014,12 @@ object SQLConf {
       "`parquet.compression` is specified in the table-specific options/properties, the " +
       "precedence would be `compression`, `parquet.compression`, " +
       "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +
-      "snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")
+      "snappy, gzip, lzo, brotli, lz4, lz4_raw, zstd.")
     .version("1.1.1")
     .stringConf
     .transform(_.toLowerCase(Locale.ROOT))
     .checkValues(
-      Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4raw", "zstd"))
+      Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4_raw", "zstd"))

Review Comment:
   Well, unfortunately, we cannot do like this because Apache Spark 3.5.0 is already released.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354138233


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala:
##########
@@ -96,7 +96,7 @@ object ParquetOptions extends DataSourceOptions {
     "lzo" -> CompressionCodecName.LZO,
     "brotli" -> CompressionCodecName.BROTLI,
     "lz4" -> CompressionCodecName.LZ4,
-    "lz4raw" -> CompressionCodecName.LZ4_RAW,

Review Comment:
   Ditto. We cannot delete like this. Only we can add like next line for backward-compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer closed pull request #43310: [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw
URL: https://github.com/apache/spark/pull/43310


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354410049


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -1014,12 +1014,12 @@ object SQLConf {
       "`parquet.compression` is specified in the table-specific options/properties, the " +
       "precedence would be `compression`, `parquet.compression`, " +
       "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +
-      "snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")
+      "snappy, gzip, lzo, brotli, lz4, lz4_raw, zstd.")
     .version("1.1.1")
     .stringConf
     .transform(_.toLowerCase(Locale.ROOT))
     .checkValues(
-      Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4raw", "zstd"))
+      Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4_raw", "zstd"))

Review Comment:
   No, if we remove this here, the production job with the existing configuration fails with Spark 3.5.1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354418155


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -1014,12 +1014,12 @@ object SQLConf {
       "`parquet.compression` is specified in the table-specific options/properties, the " +
       "precedence would be `compression`, `parquet.compression`, " +
       "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +
-      "snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")
+      "snappy, gzip, lzo, brotli, lz4, lz4_raw, zstd.")

Review Comment:
   Since there is no change at file name layer, it's good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1352405312


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala:
##########
@@ -105,7 +114,7 @@ class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSpar
 
   test("Create parquet table with compression") {
     Seq(true, false).foreach { isPartitioned =>
-      val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4")
+      val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4", "LZ4_RAW")

Review Comment:
   Before this fix, if we use `LZ4RAW` here, the test case will be failed!
   
   ```
   realCompressionCodecs.forall(((x$1: String) => x$1.==(compressionCodec))) was false
   ScalaTestFailureLocation: org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite at (ParquetCompressionCodecPrecedenceSuite.scala:101)
   org.scalatest.exceptions.TestFailedException: realCompressionCodecs.forall(((x$1: String) => x$1.==(compressionCodec))) was false
   	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   	at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
   	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$checkCompressionCodec$2(ParquetCompressionCodecPrecedenceSuite.scala:101)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
   	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
   	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
   	at org.apache.spark.sql.test.SQLTestUtilsBase.withTable(SQLTestUtils.scala:306)
   	at org.apache.spark.sql.test.SQLTestUtilsBase.withTable$(SQLTestUtils.scala:304)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.withTable(ParquetCompressionCodecPrecedenceSuite.scala:30)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$checkCompressionCodec$1(ParquetCompressionCodecPrecedenceSuite.scala:96)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$checkCompressionCodec$1$adapted(ParquetCompressionCodecPrecedenceSuite.scala:94)
   	at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:79)
   	at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:78)
   	at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:245)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(ParquetCompressionCodecPrecedenceSuite.scala:30)
   	at org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:78)
   	at org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:77)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.withTempDir(ParquetCompressionCodecPrecedenceSuite.scala:30)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.checkCompressionCodec(ParquetCompressionCodecPrecedenceSuite.scala:94)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$10(ParquetCompressionCodecPrecedenceSuite.scala:110)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$10$adapted(ParquetCompressionCodecPrecedenceSuite.scala:109)
   	at scala.collection.immutable.List.foreach(List.scala:333)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$9(ParquetCompressionCodecPrecedenceSuite.scala:109)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$9$adapted(ParquetCompressionCodecPrecedenceSuite.scala:107)
   	at scala.collection.immutable.List.foreach(List.scala:333)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$8(ParquetCompressionCodecPrecedenceSuite.scala:107)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354139275


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala:
##########
@@ -105,7 +114,7 @@ class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSpar
 
   test("Create parquet table with compression") {
     Seq(true, false).foreach { isPartitioned =>
-      val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4")
+      val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4", "LZ4_RAW")

Review Comment:
   If this is only a test utility method, `checkCompressionCodec`, failure? Or, is there any user-facing issue?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354141327


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala:
##########
@@ -59,7 +59,7 @@ class ParquetCodecSuite extends FileSourceCodecSuite {
   // Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available
   // on Maven Central.
   override protected def availableCodecs: Seq[String] = {
-    Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4", "lz4raw")
+    Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4", "lz4_raw")

Review Comment:
   This test case is success. But https://github.com/apache/spark/pull/43310/files#r1352405312 failed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43310:
URL: https://github.com/apache/spark/pull/43310#issuecomment-1771944613

   Merged. Thanks @dongjoon-hyun @srowen @HyukjinKwon @wangyum @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43310:
URL: https://github.com/apache/spark/pull/43310#issuecomment-1757018381

   @dongjoon-hyun @wangyum Because Spark 3.5.0 released. The name `lz4raw` already exposed. So we reserve the `lz4raw` for backward-compatibility and add the correct `lz4_raw`. We can deprecate `lz4raw` until some Spark version and remove it in future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43310:
URL: https://github.com/apache/spark/pull/43310#issuecomment-1769743198

   cc @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43310:
URL: https://github.com/apache/spark/pull/43310#issuecomment-1757139681

   Yes, I guess we can deprecated it at Spark 3.5.1 with a warning and remove it at Apache Spark 4.0.0. Since it's a fairly new config value, there will be not many users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #43310:
URL: https://github.com/apache/spark/pull/43310#issuecomment-1755555683

   cc @wangyum FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43310:
URL: https://github.com/apache/spark/pull/43310#issuecomment-1756849868

   cc @dongjoon-hyun @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354304018


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -1014,12 +1014,12 @@ object SQLConf {
       "`parquet.compression` is specified in the table-specific options/properties, the " +
       "precedence would be `compression`, `parquet.compression`, " +
       "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +
-      "snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")
+      "snappy, gzip, lzo, brotli, lz4, lz4_raw, zstd.")

Review Comment:
   Before this PR:
   `part-00000-fc07d464-03b2-42d6-adc1-68a3adca1752.c000.lz4raw.parquet`
   After this PR:
   `part-00000-07244014-f31a-4097-8878-dd3630e721ce.c000.lz4raw.parquet`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354250813


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala:
##########
@@ -96,7 +96,7 @@ object ParquetOptions extends DataSourceOptions {
     "lzo" -> CompressionCodecName.LZO,
     "brotli" -> CompressionCodecName.BROTLI,
     "lz4" -> CompressionCodecName.LZ4,
-    "lz4raw" -> CompressionCodecName.LZ4_RAW,

Review Comment:
   This is a good idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354296974


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala:
##########
@@ -105,7 +114,7 @@ class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSpar
 
   test("Create parquet table with compression") {
     Seq(true, false).foreach { isPartitioned =>
-      val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4")
+      val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4", "LZ4_RAW")

Review Comment:
   Yes. `checkCompressionCodec` failed! It's a test utility method and unrelated to user-facing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43310:
URL: https://github.com/apache/spark/pull/43310#discussion_r1354137166


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala:
##########
@@ -59,7 +59,7 @@ class ParquetCodecSuite extends FileSourceCodecSuite {
   // Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available
   // on Maven Central.
   override protected def availableCodecs: Seq[String] = {
-    Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4", "lz4raw")
+    Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4", "lz4_raw")

Review Comment:
   In this case, it succeeds currently. Do you know the difference, @beliefer ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43310:
URL: https://github.com/apache/spark/pull/43310#issuecomment-1767903044

   @dongjoon-hyun Could you have time to take a review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org