You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Udit Mehrotra (Jira)" <ji...@apache.org> on 2019/09/20 22:39:00 UTC
[jira] [Created] (HUDI-268) Allow parquet/avro versions upgrading
in Hudi
Udit Mehrotra created HUDI-268:
----------------------------------
Summary: Allow parquet/avro versions upgrading in Hudi
Key: HUDI-268
URL: https://issues.apache.org/jira/browse/HUDI-268
Project: Apache Hudi (incubating)
Issue Type: Improvement
Components: Hive Integration, Presto Integration, Usability
Reporter: Udit Mehrotra
As of now Hudi depends on *Parquet* *1.8.1* and *Avro* *1.7.7* which might work fine for older versions of Spark and Hive.
But when we build it with *Spark* *2.4.3* which uses *Parquet 1.10.1* and *Avro 1.8.2* using:
{code:java}
mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5-amzn-4 -Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 -Dparquet.version=1.10.1 -Davro.version=1.8.2
{code}
We run into runtime issue on *Hive 2.3.5* when querying RT tables:
{noformat}
hive> select record_key, cs_wholesale_cost, cs_ext_sales_price from catalog_sales_part_mor_sep20_01_rt limit 10;
OK
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357){noformat}
This is happening because we are shading *parquet-avro* which is now *1.10.1*. And it requires *Avro 1.8.2* which has this *LogicalType* class. However, *Hive 2.3.5* has *Avro 1.7.7* available at runtime which does not have *LogicalType* class.
To avoid these scenarios, and atleast allow usage of higher versions of Spark without affecting Hive/Presto integrations we propose the following:
* Compile Hudi with the Parquet/Avro version used by Spark.
* Shade Avro in *hadoop-mr-bundle* and *presto-bundle* to avoid issues due to older version of Avro being available there.
This will also help in our other issues, where we want to upgrade to Spark 2.4 and also deprecate use of databricks-avro. Thoughts ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)