You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Udit Mehrotra (Jira)" <ji...@apache.org> on 2019/09/20 22:39:00 UTC

[jira] [Created] (HUDI-268) Allow parquet/avro versions upgrading in Hudi

Udit Mehrotra created HUDI-268:
----------------------------------

             Summary: Allow parquet/avro versions upgrading in Hudi
                 Key: HUDI-268
                 URL: https://issues.apache.org/jira/browse/HUDI-268
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
          Components: Hive Integration, Presto Integration, Usability
            Reporter: Udit Mehrotra


As of now Hudi depends on *Parquet* *1.8.1* and *Avro* *1.7.7* which might work fine for older versions of Spark and Hive.

But when we build it with *Spark* *2.4.3* which uses *Parquet 1.10.1*  and *Avro 1.8.2* using:
{code:java}
mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5-amzn-4 -Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 -Dparquet.version=1.10.1 -Davro.version=1.8.2
{code}
We run into runtime issue on *Hive 2.3.5* when querying RT tables:
{noformat}
hive> select record_key, cs_wholesale_cost, cs_ext_sales_price from catalog_sales_part_mor_sep20_01_rt limit 10;
OK
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
	at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
	at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
	at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
	at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
	at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
	at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357){noformat}
This is happening because we are shading *parquet-avro* which is now *1.10.1*. And it requires *Avro 1.8.2* which has this *LogicalType* class. However, *Hive 2.3.5* has *Avro 1.7.7* available at runtime which does not have *LogicalType* class.

To avoid these scenarios, and atleast allow usage of higher versions of Spark without affecting Hive/Presto integrations we propose the following:
 * Compile Hudi with the Parquet/Avro version used by Spark.
 * Shade Avro in *hadoop-mr-bundle* and *presto-bundle* to avoid issues due to older version of Avro being available there.

This will also help in our other issues, where we want to upgrade to Spark 2.4 and also deprecate use of databricks-avro. Thoughts ?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)