You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "beliefer (via GitHub)" <gi...@apache.org> on 2023/11/01 12:59:39 UTC

[PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

beliefer opened a new pull request, #43620:
URL: https://github.com/apache/spark/pull/43620

   ### What changes were proposed in this pull request?
   Currently, Spark supported partial Hadoop compression codecs, but the Hadoop supported compression codecs and spark supported are not completely one-on-one due to Spark introduce two fake compression codecs none and uncompress.
   There are a lot of magic strings copy from Hadoop compression codecs. This issue lead to developers need to manually maintain its consistency. It is easy to make mistakes and reduce development efficiency.
   
   
   ### Why are the changes needed?
   Let developers easy to use Hadoop compression codecs.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   Introduce a new class.
   
   
   ### How was this patch tested?
   Exists test cases.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   'No'.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1380368351


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##########
@@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest
         // The data set has 2 partitions, so Spark will write at least 2 json files.
         // Use a non-splittable compression (gzip), to make sure the json scan RDD has at least 2
         // partitions.
-        .write.partitionBy("p").option("compression", "gzip").json(path.getCanonicalPath)
+        .write.partitionBy("p")
+        .option("compression", GZIP.lowerCaseName()).json(path.getCanonicalPath)

Review Comment:
   it's recommended to eliminate the `()` on calling Java no-arg method which has no side-effects
   ```suggestion
           .option("compression", GZIP.lowerCaseName).json(path.getCanonicalPath)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381132422


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##########
@@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest
         // The data set has 2 partitions, so Spark will write at least 2 json files.
         // Use a non-splittable compression (gzip), to make sure the json scan RDD has at least 2
         // partitions.
-        .write.partitionBy("p").option("compression", "gzip").json(path.getCanonicalPath)
+        .write.partitionBy("p")
+        .option("compression", GZIP.lowerCaseName()).json(path.getCanonicalPath)

Review Comment:
   @pan3793 Could these cases possibly manifest as compilation warnings? Or are they merely suggestions for best practices?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1380243958


##########
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java:
##########
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.Arrays;
+import java.util.Locale;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import org.apache.hadoop.io.compress.BZip2Codec;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.DeflateCodec;
+import org.apache.hadoop.io.compress.GzipCodec;
+import org.apache.hadoop.io.compress.Lz4Codec;
+import org.apache.hadoop.io.compress.SnappyCodec;
+
+/**
+ * A mapper class from Spark supported hadoop compression codecs to hadoop compression codecs.
+ */
+public enum HadoopCompressionCodec {
+  NONE(null),
+  UNCOMPRESSED(null),
+  BZIP2(new BZip2Codec()),
+  DEFLATE(new DeflateCodec()),
+  GZIP(new GzipCodec()),
+  LZ4(new Lz4Codec()),
+  SNAPPY(new SnappyCodec());
+
+  // TODO supports ZStandardCodec
+
+  private final CompressionCodec compressionCodec;
+
+  HadoopCompressionCodec(CompressionCodec compressionCodec) {
+    this.compressionCodec = compressionCodec;
+  }
+
+  public CompressionCodec getCompressionCodec() {
+    return this.compressionCodec;
+  }
+
+  private static final Map<String, String> codecNameMap =
+    Arrays.stream(HadoopCompressionCodec.values()).collect(
+      Collectors.toMap(codec -> codec.name(), codec -> codec.name().toLowerCase(Locale.ROOT)));

Review Comment:
   ```suggestion
         Collectors.toMap(Enum::name, codec -> codec.name().toLowerCase(Locale.ROOT)));
   ```



##########
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java:
##########
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.Arrays;
+import java.util.Locale;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import org.apache.hadoop.io.compress.BZip2Codec;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.DeflateCodec;
+import org.apache.hadoop.io.compress.GzipCodec;
+import org.apache.hadoop.io.compress.Lz4Codec;
+import org.apache.hadoop.io.compress.SnappyCodec;
+
+/**
+ * A mapper class from Spark supported hadoop compression codecs to hadoop compression codecs.
+ */
+public enum HadoopCompressionCodec {
+  NONE(null),
+  UNCOMPRESSED(null),
+  BZIP2(new BZip2Codec()),
+  DEFLATE(new DeflateCodec()),
+  GZIP(new GzipCodec()),
+  LZ4(new Lz4Codec()),
+  SNAPPY(new SnappyCodec());
+
+  // TODO supports ZStandardCodec
+
+  private final CompressionCodec compressionCodec;
+
+  HadoopCompressionCodec(CompressionCodec compressionCodec) {
+    this.compressionCodec = compressionCodec;
+  }
+
+  public CompressionCodec getCompressionCodec() {
+    return this.compressionCodec;
+  }
+
+  private static final Map<String, String> codecNameMap =
+    Arrays.stream(HadoopCompressionCodec.values()).collect(
+      Collectors.toMap(codec -> codec.name(), codec -> codec.name().toLowerCase(Locale.ROOT)));

Review Comment:
   nit: we can use method reference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1380365514


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala:
##########
@@ -21,19 +21,15 @@ import java.util.Locale
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.io.SequenceFile.CompressionType
-import org.apache.hadoop.io.compress._
 
 import org.apache.spark.util.Utils
 
 object CompressionCodecs {
-  private val shortCompressionCodecNames = Map(
-    "none" -> null,
-    "uncompressed" -> null,
-    "bzip2" -> classOf[BZip2Codec].getName,
-    "deflate" -> classOf[DeflateCodec].getName,
-    "gzip" -> classOf[GzipCodec].getName,
-    "lz4" -> classOf[Lz4Codec].getName,
-    "snappy" -> classOf[SnappyCodec].getName)
+  private val shortCompressionCodecNames = HadoopCompressionCodec.values().map { codec =>
+    val className =
+      if (codec.getCompressionCodec == null) null else codec.getCompressionCodec.getClass.getName
+    codec.lowerCaseName() -> className

Review Comment:
   ```suggestion
       codec.lowerCaseName() -> Option(codec.getCompressionCodec).map(_.getClass.getName).orNull
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43620:
URL: https://github.com/apache/spark/pull/43620#issuecomment-1790151607

   ping @dongjoon-hyun @srowen @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381028768


##########
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java:
##########
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.Arrays;
+import java.util.Locale;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import org.apache.hadoop.io.compress.BZip2Codec;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.DeflateCodec;
+import org.apache.hadoop.io.compress.GzipCodec;
+import org.apache.hadoop.io.compress.Lz4Codec;
+import org.apache.hadoop.io.compress.SnappyCodec;
+
+/**
+ * A mapper class from Spark supported hadoop compression codecs to hadoop compression codecs.
+ */
+public enum HadoopCompressionCodec {
+  NONE(null),
+  UNCOMPRESSED(null),
+  BZIP2(new BZip2Codec()),
+  DEFLATE(new DeflateCodec()),
+  GZIP(new GzipCodec()),
+  LZ4(new Lz4Codec()),
+  SNAPPY(new SnappyCodec());
+
+  // TODO supports ZStandardCodec
+
+  private final CompressionCodec compressionCodec;
+
+  HadoopCompressionCodec(CompressionCodec compressionCodec) {
+    this.compressionCodec = compressionCodec;
+  }
+
+  public CompressionCodec getCompressionCodec() {
+    return this.compressionCodec;
+  }
+
+  private static final Map<String, String> codecNameMap =
+    Arrays.stream(HadoopCompressionCodec.values()).collect(
+      Collectors.toMap(codec -> codec.name(), codec -> codec.name().toLowerCase(Locale.ROOT)));

Review Comment:
   I feel like both are OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1382330582


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##########
@@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest
         // The data set has 2 partitions, so Spark will write at least 2 json files.
         // Use a non-splittable compression (gzip), to make sure the json scan RDD has at least 2
         // partitions.
-        .write.partitionBy("p").option("compression", "gzip").json(path.getCanonicalPath)
+        .write.partitionBy("p")
+        .option("compression", GZIP.lowerCaseName()).json(path.getCanonicalPath)

Review Comment:
   According to the discussion offline between @cloud-fan and me. This change no need here.
   Spark will add the compilation warnings if Java no-arg method called without ().



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43620:
URL: https://github.com/apache/spark/pull/43620#issuecomment-1790151803

   The GA failure is unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381192396


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##########
@@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest
         // The data set has 2 partitions, so Spark will write at least 2 json files.
         // Use a non-splittable compression (gzip), to make sure the json scan RDD has at least 2
         // partitions.
-        .write.partitionBy("p").option("compression", "gzip").json(path.getCanonicalPath)
+        .write.partitionBy("p")
+        .option("compression", GZIP.lowerCaseName()).json(path.getCanonicalPath)

Review Comment:
   It's a best practice, no compile warnings, because the compiler has no idea will this method have side effects.
   
   Suggestion about [uniform access principle](https://en.wikipedia.org/wiki/Uniform_access_principle) from [Databricks Scala Guide](https://github.com/databricks/scala-style-guide#parentheses) and IntelliJ IDEA
   
   > Reports calls to Java accessor methods with empty argument clauses.
   >
   > Methods that follow the JavaBean naming contract for accessors are expected to have no side effects. The recommended convention is to use a parameterless method whenever there are no parameters and the method have no side effect. This convention promotes the [uniform access principle](https://en.wikipedia.org/wiki/Uniform_access_principle), which says that the client code should not be affected by the decision to implement an attribute as a field or method.
   > The problem is that Java does not implement the uniform access principle. To bridge that gap, Scala allows you to leave off the empty parentheses on an invocation of a Java method that takes no arguments.
   > 
   > The quick-fix removes the empty argument clause.
   > Example:
   > ```
   > "test".getClass()
   > ```
   > After the quick-fix is applied:
   > ```
   > "test".getClass
   > ```
   
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1380365514


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala:
##########
@@ -21,19 +21,15 @@ import java.util.Locale
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.io.SequenceFile.CompressionType
-import org.apache.hadoop.io.compress._
 
 import org.apache.spark.util.Utils
 
 object CompressionCodecs {
-  private val shortCompressionCodecNames = Map(
-    "none" -> null,
-    "uncompressed" -> null,
-    "bzip2" -> classOf[BZip2Codec].getName,
-    "deflate" -> classOf[DeflateCodec].getName,
-    "gzip" -> classOf[GzipCodec].getName,
-    "lz4" -> classOf[Lz4Codec].getName,
-    "snappy" -> classOf[SnappyCodec].getName)
+  private val shortCompressionCodecNames = HadoopCompressionCodec.values().map { codec =>
+    val className =
+      if (codec.getCompressionCodec == null) null else codec.getCompressionCodec.getClass.getName
+    codec.lowerCaseName() -> className

Review Comment:
   ```suggestion
       val className =
         if (codec.getCompressionCodec == null) null else codec.getCompressionCodec.getClass.getName
       codec.lowerCaseName() -> Option(codec.getCompressionCodec).map(_.getClass.getName).orNull
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381193392


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##########
@@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest
         // The data set has 2 partitions, so Spark will write at least 2 json files.
         // Use a non-splittable compression (gzip), to make sure the json scan RDD has at least 2
         // partitions.
-        .write.partitionBy("p").option("compression", "gzip").json(path.getCanonicalPath)
+        .write.partitionBy("p")
+        .option("compression", GZIP.lowerCaseName()).json(path.getCanonicalPath)

Review Comment:
   > Because there are already a lot of Java no-arg method called with ().
   
   @beliefer the key point is calling of "no-arg method **without side effects**"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43620:
URL: https://github.com/apache/spark/pull/43620#issuecomment-1794471567

   Merged to master.
   @LuciferYang @pan3793 Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381041537


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala:
##########
@@ -21,19 +21,15 @@ import java.util.Locale
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.io.SequenceFile.CompressionType
-import org.apache.hadoop.io.compress._
 
 import org.apache.spark.util.Utils
 
 object CompressionCodecs {
-  private val shortCompressionCodecNames = Map(
-    "none" -> null,
-    "uncompressed" -> null,
-    "bzip2" -> classOf[BZip2Codec].getName,
-    "deflate" -> classOf[DeflateCodec].getName,
-    "gzip" -> classOf[GzipCodec].getName,
-    "lz4" -> classOf[Lz4Codec].getName,
-    "snappy" -> classOf[SnappyCodec].getName)
+  private val shortCompressionCodecNames = HadoopCompressionCodec.values().map { codec =>
+    val className =
+      if (codec.getCompressionCodec == null) null else codec.getCompressionCodec.getClass.getName
+    codec.lowerCaseName() -> className

Review Comment:
   Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381060163


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##########
@@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest
         // The data set has 2 partitions, so Spark will write at least 2 json files.
         // Use a non-splittable compression (gzip), to make sure the json scan RDD has at least 2
         // partitions.
-        .write.partitionBy("p").option("compression", "gzip").json(path.getCanonicalPath)
+        .write.partitionBy("p")
+        .option("compression", GZIP.lowerCaseName()).json(path.getCanonicalPath)

Review Comment:
   Because there are already a lot of Java no-arg method called with ().
   Let's update them if we really have the strong requirements.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43620:
URL: https://github.com/apache/spark/pull/43620#issuecomment-1793707313

   The GA failure is unrelated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer closed pull request #43620: [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs
URL: https://github.com/apache/spark/pull/43620


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org