You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Udit Mehrotra (Jira)" <ji...@apache.org> on 2019/09/20 23:11:00 UTC
[jira] [Commented] (HUDI-268) Allow parquet/avro versions upgrading in Hudi

    [ https://issues.apache.org/jira/browse/HUDI-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934817#comment-16934817 ] 

Udit Mehrotra commented on HUDI-268:
------------------------------------

Created PR for the same https://github.com/apache/incubator-hudi/pull/915

> Allow parquet/avro versions upgrading in Hudi
> ---------------------------------------------
>
>                 Key: HUDI-268
>                 URL: https://issues.apache.org/jira/browse/HUDI-268
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Hive Integration, Usability
>            Reporter: Udit Mehrotra
>            Assignee: Udit Mehrotra
>            Priority: Major
>
> As of now Hudi depends on *Parquet* *1.8.1* and *Avro* *1.7.7* which might work fine for older versions of Spark and Hive.
> But when we build it with *Spark* *2.4.3* which uses *Parquet 1.10.1* and *Avro 1.8.2* using:
> {code:java}
> mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5 -Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 -Dparquet.version=1.10.1 -Davro.version=1.8.2
> {code}
> We run into runtime issue on *Hive 2.3.5* when querying RT tables:
> {noformat}
> hive> select record_key from mytable_mor_sep20_01_rt limit 10;
> OK
> Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
> 	at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
> 	at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
> 	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
> 	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
> 	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
> 	at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
> 	at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
> 	at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
> 	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
> 	at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
> 	at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
> 	at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
> 	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
> 	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
> 	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
> 	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
> 	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
> 	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
> 	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> 	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:357){noformat}
> This is happening because we are shading *parquet-avro* which is now *1.10.1*. And it requires *Avro 1.8.2* which has this *LogicalType* class. However, *Hive 2.3.5* has *Avro 1.7.7* available at runtime which does not have *LogicalType* class.
> To avoid these scenarios, and atleast allow usage of higher versions of Spark without affecting Hive integrations we propose the following:
>  * Compile Hudi with the Parquet/Avro version used by Spark always.
>  * Shade Avro in *hadoop-mr-bundle* to avoid issues due to older version of Avro being available there.
> This will also help in our other issues, where we want to upgrade to Spark 2.4 and also deprecate use of databricks-avro. Thoughts ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)