You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by li...@apache.org on 2019/04/19 15:59:25 UTC
[spark] branch master updated: [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 777b450  [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2
777b450 is described below

commit 777b4502b206b7240c6655d3c0b0a9ce08f6a09c
Author: Yuming Wang <yu...@ebay.com>
AuthorDate: Fri Apr 19 08:59:08 2019 -0700

    [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2
    
    ## What changes were proposed in this pull request?
    
    When we compile and test Hadoop 3.2, we will hint the following two issues:
    1. JobSummaryLevel is not a member of object org.apache.parquet.hadoop.ParquetOutputFormat. Fixed by [PARQUET-381](https://issues.apache.org/jira/browse/PARQUET-381)(Parquet 1.9.0)
    2. java.lang.NoSuchFieldError: BROTLI
        at org.apache.parquet.hadoop.metadata.CompressionCodecName.<clinit>(CompressionCodecName.java:31). Fixed by [PARQUET-1143](https://issues.apache.org/jira/browse/PARQUET-1143)(Parquet 1.10.0)
    
    The reason is that the `parquet-hadoop-bundle-1.8.1.jar` conflicts with Parquet 1.10.1.
    I think it would be safe to upgrade Hive's parquet to 1.10.1 to workaround this issue.
    
    This is what Hive did when upgrading Parquet 1.8.1 to 1.10.0: [HIVE-17000](https://issues.apache.org/jira/browse/HIVE-17000) and [HIVE-19464](https://issues.apache.org/jira/browse/HIVE-19464). We can see that all changes are related to vectors, and vectors are disabled by default: see [HIVE-14826](https://issues.apache.org/jira/browse/HIVE-14826) and [HiveConf.java#L2723](https://github.com/apache/hive/blob/rel/release-2.3.4/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2723).
    
    This pr removes [parquet-hadoop-bundle-1.8.1.jar](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop-bundle) , so Hive serde will use [parquet-common-1.10.1.jar, parquet-column-1.10.1.jar and parquet-hadoop-1.10.1.jar](https://github.com/apache/spark/blob/master/dev/deps/spark-deps-hadoop-3.2#L185-L189).
    
    ## How was this patch tested?
    
    1. manual tests
    2. [upgrade Hive Parquet to 1.10.1 annd run Hadoop 3.2 test on jenkins](https://github.com/apache/spark/pull/24044#commits-pushed-0c3f962)
    
    Closes #24346 from wangyum/SPARK-27176.
    
    Authored-by: Yuming Wang <yu...@ebay.com>
    Signed-off-by: gatorsmile <ga...@gmail.com>
---
 dev/deps/spark-deps-hadoop-3.2 | 1 -
 pom.xml                        | 8 +++++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3.2 b/dev/deps/spark-deps-hadoop-3.2
index a45f02d..8b3bd79 100644
--- a/dev/deps/spark-deps-hadoop-3.2
+++ b/dev/deps/spark-deps-hadoop-3.2
@@ -187,7 +187,6 @@ parquet-common-1.10.1.jar
 parquet-encoding-1.10.1.jar
 parquet-format-2.4.0.jar
 parquet-hadoop-1.10.1.jar
-parquet-hadoop-bundle-1.6.0.jar
 parquet-jackson-1.10.1.jar
 protobuf-java-2.5.0.jar
 py4j-0.10.8.1.jar
diff --git a/pom.xml b/pom.xml
index fce4cbd..5879a76 100644
--- a/pom.xml
+++ b/pom.xml
@@ -221,6 +221,7 @@
     -->
     <hadoop.deps.scope>compile</hadoop.deps.scope>
     <hive.deps.scope>compile</hive.deps.scope>
+    <hive.parquet.scope>${hive.deps.scope}</hive.parquet.scope>
     <orc.deps.scope>compile</orc.deps.scope>
     <parquet.deps.scope>compile</parquet.deps.scope>
     <parquet.test.deps.scope>test</parquet.test.deps.scope>
@@ -2004,7 +2005,7 @@
         <groupId>${hive.parquet.group}</groupId>
         <artifactId>parquet-hadoop-bundle</artifactId>
         <version>${hive.parquet.version}</version>
-        <scope>compile</scope>
+        <scope>${hive.parquet.scope}</scope>
       </dependency>
       <dependency>
         <groupId>org.codehaus.janino</groupId>
@@ -2818,8 +2819,9 @@
         <hive.classifier>core</hive.classifier>
         <hive.version>${hive23.version}</hive.version>
         <hive.version.short>2.3.4</hive.version.short>
-        <hive.parquet.group>org.apache.parquet</hive.parquet.group>
-        <hive.parquet.version>1.8.1</hive.parquet.version>
+        <!-- Do not need parquet-hadoop-bundle because we already have
+          parquet-common, parquet-column and parquet-hadoop -->
+        <hive.parquet.scope>provided</hive.parquet.scope>
         <orc.classifier></orc.classifier>
         <datanucleus-core.version>4.1.17</datanucleus-core.version>
       </properties>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org