You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/11/14 21:55:55 UTC
[GitHub] [incubator-hudi] umehrot2 edited a comment on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

umehrot2 edited a comment on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554089712
 
 
   @modi95 @bvaradar 
   
   I was able to fix the integration test dependency issues on my local atleast. Hoping that things run fine on Travis too. To give an overview, there were 3 major failures happening:
   
   1. The `ITTestHoodieSanity` tests were failing firstly becuase of this error:
   ```
   17:15:31.995 [pool-21-thread-2] ERROR org.apache.hudi.io.HoodieCreateHandle - Error writing record HoodieRecord{key=HoodieKey { recordKey=98ea14b7-b318-4b0b-9f14-0115900a10e0 partitionPath=2016/03/15}, currentLocation='null', newLocation='null'}
   
   java.lang.NoSuchMethodError: org.apache.parquet.io.api.Binary.fromCharSequence(Ljava/lang/CharSequence;)Lorg/apache/parquet/io/api/Binary;
   
   	at org.apache.parquet.avro.AvroWriteSupport.fromAvroString(AvroWriteSupport.java:371) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:346) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121) ~[hive-exec-2.3.1.jar:1.10.1]
   
   	at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:288) ~[hive-exec-2.3.1.jar:1.10.1]
   
   	at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvroWithMetadata(HoodieParquetWriter.java:91) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.hudi.io.HoodieCreateHandle.write(HoodieCreateHandle.java:101) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:150) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:142) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:125) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
   	at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:38) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   ```
   
   This is happening because in Hudi even for bits running through Spark we are using `Hive 2.3.1` which is not really compatible with Spark. So, `hive-exec 2.3.1` ends up in `HoodieJavaApp` classpath while running the example, and that has its own shaded parquet version which is old and conflicts with `parquet 1.10.1`.
   
   What I propose here, is that we should use version of Hive that is compatible with Spark, atleast for the bits running inside Spark so that compatible versions of Hive end up in class paths. Now `hive-exec 1.2.1.spark2` does not cause this issue as it does not shade parquet. Also, we have removed Hive shading in master now, so anyways we are dependent on runtime Hive version which is Spark's Hive version. So, from code's perspective also I think it makes sense to depend on Spark's Hive version for the code which is running inside of Spark to avoid such issues.
   
   2. Post that `ITTestHoodieSanity` all the `_rt` tests were failing because now that our code is using `Avro 1.8.2` and Hive is still on older versions, we need to shade avro in `hudi-hadoop-mr-bundle` which we had done internally for EMR through an optional profile. Now that we are migrating Hudi itself to Avro 1.8.2 we need to always shade Hive to get around this issue. More details on https://issues.apache.org/jira/browse/HUDI-268
   
   3. Finally some tests were failing because `spark-avro` was not being passed while starting the spark-shell, and it was not finding the classes. So, I switched over to downloading `spark-avro` instead of `databricks-avro`
   
   By making the above changes, the integration tests work now. Let me know your thoughts about these changes, if there are concerns.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services