You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "beliefer (via GitHub)" <gi...@apache.org> on 2023/10/10 09:23:11 UTC

[PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

beliefer opened a new pull request, #43308:
URL: https://github.com/apache/spark/pull/43308

   ### What changes were proposed in this pull request?
   Currently, Spark supported most of all parquet compression codecs, the parquet supported compression codecs and spark supported are not completely one-on-one.
   There are a lot of magic strings copy from parquet compression codecs. This issue lead to developers need to manually maintain its consistency. It is easy to make mistakes and reduce development efficiency.
   
   
   ### Why are the changes needed?
   Let developers easy to use parquet compression codecs.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   Introduce a new class.
   
   
   ### How was this patch tested?
   Exists test cases.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   'No'.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #43308:
URL: https://github.com/apache/spark/pull/43308#issuecomment-1782205599

   The GA failure is unrelated.
   Merged to master
   @srowen @LuciferYang Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1372833363


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala:
##########
@@ -857,12 +855,12 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
 
     // Checks default compression codec
     checkCompressionCodec(
-      CompressionCodecName.fromConf(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
+      ParquetCompressionCodec.fromString(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
 
-    checkCompressionCodec(CompressionCodecName.UNCOMPRESSED)
-    checkCompressionCodec(CompressionCodecName.GZIP)
-    checkCompressionCodec(CompressionCodecName.SNAPPY)
-    checkCompressionCodec(CompressionCodecName.ZSTD)
+    checkCompressionCodec(ParquetCompressionCodec.UNCOMPRESSED)

Review Comment:
   ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer closed pull request #43308: [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs
URL: https://github.com/apache/spark/pull/43308


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1372808047


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala:
##########
@@ -857,12 +855,12 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
 
     // Checks default compression codec
     checkCompressionCodec(
-      CompressionCodecName.fromConf(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
+      ParquetCompressionCodec.fromString(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
 
-    checkCompressionCodec(CompressionCodecName.UNCOMPRESSED)
-    checkCompressionCodec(CompressionCodecName.GZIP)
-    checkCompressionCodec(CompressionCodecName.SNAPPY)
-    checkCompressionCodec(CompressionCodecName.ZSTD)
+    checkCompressionCodec(ParquetCompressionCodec.UNCOMPRESSED)

Review Comment:
   Unrelated to this pr, but why were only four types of Compression Codec tested here? Was the test case not modified when a new type was added?



##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##########
@@ -2709,31 +2709,32 @@ class HiveDDLSuite
     assert(compression === actualCompression)
   }
 
-  Seq(("orc", "ZLIB"), ("parquet", "GZIP")).foreach { case (fileFormat, compression) =>
-    test(s"SPARK-22158 convertMetastore should not ignore table property - $fileFormat") {
-      withSQLConf(CONVERT_METASTORE_ORC.key -> "true", CONVERT_METASTORE_PARQUET.key -> "true") {
-        withTable("t") {
-          withTempPath { path =>
-            sql(
-              s"""
-                |CREATE TABLE t(id int) USING hive
-                |OPTIONS(fileFormat '$fileFormat', compression '$compression')
-                |LOCATION '${path.toURI}'
+  Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name)).foreach {

Review Comment:
   nit: Make `Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name))` a variable, and then use `seq..foreach { case (fileFormat, compression) =>`. Would the code below need to be reformatted?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1372835505


##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##########
@@ -2709,31 +2709,32 @@ class HiveDDLSuite
     assert(compression === actualCompression)
   }
 
-  Seq(("orc", "ZLIB"), ("parquet", "GZIP")).foreach { case (fileFormat, compression) =>
-    test(s"SPARK-22158 convertMetastore should not ignore table property - $fileFormat") {
-      withSQLConf(CONVERT_METASTORE_ORC.key -> "true", CONVERT_METASTORE_PARQUET.key -> "true") {
-        withTable("t") {
-          withTempPath { path =>
-            sql(
-              s"""
-                |CREATE TABLE t(id int) USING hive
-                |OPTIONS(fileFormat '$fileFormat', compression '$compression')
-                |LOCATION '${path.toURI}'
+  Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name)).foreach {

Review Comment:
   For reduce the change here, let's use
   ```
   Seq(
     ("orc", "ZLIB"),
     ("parquet", ParquetCompressionCodec.GZIP.name)).foreach { case (fileFormat, compression) =>
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1367801622


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecMapper.java:
##########
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.util.Locale;
+
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * A mapper class easy to obtain compression codecs based on their names.
+ */
+public enum ParquetCompressionCodecMapper {
+  NONE(null),
+  UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED),

Review Comment:
   Why make our own enum if there is already an enum-like list of codecs in parquet?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #43308:
URL: https://github.com/apache/spark/pull/43308#issuecomment-1777081758

   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1372918425


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala:
##########
@@ -857,12 +855,12 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
 
     // Checks default compression codec
     checkCompressionCodec(
-      CompressionCodecName.fromConf(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
+      ParquetCompressionCodec.fromString(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
 
-    checkCompressionCodec(CompressionCodecName.UNCOMPRESSED)
-    checkCompressionCodec(CompressionCodecName.GZIP)
-    checkCompressionCodec(CompressionCodecName.SNAPPY)
-    checkCompressionCodec(CompressionCodecName.ZSTD)
+    checkCompressionCodec(ParquetCompressionCodec.UNCOMPRESSED)

Review Comment:
   I got it now. `lzo` is supported by cloudera Hadoop. Spark doesn't have it built-in.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1352297524


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecMapper.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.util.Locale;
+
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * A mapper class easy to obtain compression codecs based on their names.
+ */
+public enum ParquetCompressionCodecMapper {
+  NONE(null),
+  UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED),
+  SNAPPY(CompressionCodecName.SNAPPY),
+  GZIP(CompressionCodecName.GZIP),
+  LZO(CompressionCodecName.LZO),
+  BROTLI(CompressionCodecName.BROTLI),
+  LZ4(CompressionCodecName.LZ4),
+  ZSTD(CompressionCodecName.ZSTD);
+
+  // There are some parquet supported compression codec that doesn't supported by Spark.
+  // LZ4_RAW(CompressionCodecName.LZ4_RAW)

Review Comment:
   https://github.com/apache/spark/pull/41507
   
   IIRC, LZ4_RAW is already supported?



##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecMapper.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.util.Locale;
+
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * A mapper class easy to obtain compression codecs based on their names.
+ */
+public enum ParquetCompressionCodecMapper {
+  NONE(null),
+  UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED),
+  SNAPPY(CompressionCodecName.SNAPPY),
+  GZIP(CompressionCodecName.GZIP),
+  LZO(CompressionCodecName.LZO),
+  BROTLI(CompressionCodecName.BROTLI),
+  LZ4(CompressionCodecName.LZ4),
+  ZSTD(CompressionCodecName.ZSTD);
+
+  // There are some parquet supported compression codec that doesn't supported by Spark.
+  // LZ4_RAW(CompressionCodecName.LZ4_RAW)

Review Comment:
   https://github.com/apache/spark/pull/41507
   
   IIRC, LZ4_RAW is already supported?



##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecMapper.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.util.Locale;
+
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * A mapper class easy to obtain compression codecs based on their names.
+ */
+public enum ParquetCompressionCodecMapper {
+  NONE(null),
+  UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED),
+  SNAPPY(CompressionCodecName.SNAPPY),
+  GZIP(CompressionCodecName.GZIP),
+  LZO(CompressionCodecName.LZO),
+  BROTLI(CompressionCodecName.BROTLI),
+  LZ4(CompressionCodecName.LZ4),
+  ZSTD(CompressionCodecName.ZSTD);
+
+  // There are some parquet supported compression codec that doesn't supported by Spark.
+  // LZ4_RAW(CompressionCodecName.LZ4_RAW)

Review Comment:
   https://github.com/apache/spark/pull/41507
   
   IIRC, `LZ4_RAW` is already supported?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #43308:
URL: https://github.com/apache/spark/pull/43308#issuecomment-1778869474

   cc @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #43308:
URL: https://github.com/apache/spark/pull/43308#issuecomment-1772493894

   ping @dongjoon-hyun @srowen @wangyum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1369519951


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecMapper.java:
##########
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.util.Locale;
+
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * A mapper class easy to obtain compression codecs based on their names.
+ */
+public enum ParquetCompressionCodecMapper {
+  NONE(null),
+  UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED),

Review Comment:
   Ok, looks fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1352305283


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecMapper.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.util.Locale;
+
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * A mapper class easy to obtain compression codecs based on their names.
+ */
+public enum ParquetCompressionCodecMapper {
+  NONE(null),
+  UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED),
+  SNAPPY(CompressionCodecName.SNAPPY),
+  GZIP(CompressionCodecName.GZIP),
+  LZO(CompressionCodecName.LZO),
+  BROTLI(CompressionCodecName.BROTLI),
+  LZ4(CompressionCodecName.LZ4),
+  ZSTD(CompressionCodecName.ZSTD);
+
+  // There are some parquet supported compression codec that doesn't supported by Spark.
+  // LZ4_RAW(CompressionCodecName.LZ4_RAW)

Review Comment:
   Thank you for the reminder. I will check it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1372837027


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala:
##########
@@ -857,12 +855,12 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
 
     // Checks default compression codec
     checkCompressionCodec(
-      CompressionCodecName.fromConf(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
+      ParquetCompressionCodec.fromString(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
 
-    checkCompressionCodec(CompressionCodecName.UNCOMPRESSED)
-    checkCompressionCodec(CompressionCodecName.GZIP)
-    checkCompressionCodec(CompressionCodecName.SNAPPY)
-    checkCompressionCodec(CompressionCodecName.ZSTD)
+    checkCompressionCodec(ParquetCompressionCodec.UNCOMPRESSED)

Review Comment:
   If I have time, I will try to cover these tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1372819401


##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##########
@@ -2709,31 +2709,32 @@ class HiveDDLSuite
     assert(compression === actualCompression)
   }
 
-  Seq(("orc", "ZLIB"), ("parquet", "GZIP")).foreach { case (fileFormat, compression) =>
-    test(s"SPARK-22158 convertMetastore should not ignore table property - $fileFormat") {
-      withSQLConf(CONVERT_METASTORE_ORC.key -> "true", CONVERT_METASTORE_PARQUET.key -> "true") {
-        withTable("t") {
-          withTempPath { path =>
-            sql(
-              s"""
-                |CREATE TABLE t(id int) USING hive
-                |OPTIONS(fileFormat '$fileFormat', compression '$compression')
-                |LOCATION '${path.toURI}'
+  Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name)).foreach {

Review Comment:
   nit: Make `Seq(("orc", "ZLIB"), ("parquet", ParquetCompressionCodec.GZIP.name))` a variable, and then use `seq.foreach { case (fileFormat, compression) =>`. Would the code below need to be reformatted?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1372832472


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala:
##########
@@ -857,12 +855,12 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
 
     // Checks default compression codec
     checkCompressionCodec(
-      CompressionCodecName.fromConf(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
+      ParquetCompressionCodec.fromString(spark.conf.get(SQLConf.PARQUET_COMPRESSION)))
 
-    checkCompressionCodec(CompressionCodecName.UNCOMPRESSED)
-    checkCompressionCodec(CompressionCodecName.GZIP)
-    checkCompressionCodec(CompressionCodecName.SNAPPY)
-    checkCompressionCodec(CompressionCodecName.ZSTD)
+    checkCompressionCodec(ParquetCompressionCodec.UNCOMPRESSED)

Review Comment:
   I tested the other compression codec, the tests failed!
   It seems not supported the others yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #43308:
URL: https://github.com/apache/spark/pull/43308#discussion_r1368071970


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecMapper.java:
##########
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.util.Locale;
+
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * A mapper class easy to obtain compression codecs based on their names.
+ */
+public enum ParquetCompressionCodecMapper {
+  NONE(null),
+  UNCOMPRESSED(CompressionCodecName.UNCOMPRESSED),

Review Comment:
   One reason is Spark add the compression codecs `none`, another is out-of-date. Before https://github.com/apache/spark/pull/43310, the parquet supported compression codecs and spark supported are not completely one-on-one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #43308:
URL: https://github.com/apache/spark/pull/43308#issuecomment-1768504301

   Can we retrigger tests?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45481][SQL] Introduce a mapper for parquet compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #43308:
URL: https://github.com/apache/spark/pull/43308#issuecomment-1772579507

   The GA failure is unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org