You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/11/19 08:16:10 UTC

[GitHub] [incubator-hudi] n3nash commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

n3nash commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-555386022

@umehrot2 Thanks for enumerating your thoughts. Let me add some more context here.

Firstly, hive-exec has a classifier `core` that allows you to get a dependency reduced version of the jar. Although this allows us to workaround the fat jar problem, there is another problem with this dependency reduced version of the jar which doesn't package some of the required transitive dependencies needed by classes in this jar. There are ways to fix this as well by including those relocated dependencies directly in Hudi (@modi95 was trying it at Uber)

Secondly, there is no support for Spark's fork of Hive (1.2.1.spark.2). This was forked by the Spark community to solve the exact issue of hive jars not bundling the correct dependencies that I described above, read more here : https://issues.apache.org/jira/browse/HIVE-16391?focusedCommentId=16032497&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16032497 and then some more changes were added to the fork which are NOT necessary according to the comments in the same jira.

In fact, there is a strong need in the spark community to move away from this forked version to the regular hive version. See here : http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Upgrade-built-in-Hive-to-2-3-4-td26153.html.

But I see your point on having the spark-modules depend on the spark-hive version, this way it's clear and we don't have to solve this issue ourselves.

I have a few hesitations in introducing a spark's forked hive version : a) This means we have 2 hive versions across the project b) The spark's forked version of hive doesn't have anything more apart from solving the hive-exec jar mess.
I'm actually okay with (b). @vinothchandar @bvaradar If you're okay with (a) and don't see any issues, I'm fine with taking this approach. Personally, don't have much foresight on the side-effects of having different versions across the project in the future.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services