You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2017/06/01 05:56:04 UTC
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

    [ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032497#comment-16032497 ] 

Josh Rosen commented on HIVE-16391:
-----------------------------------

I tried to see whether Spark can consume existing Hive 1.2.1 artifacts, but it looks like neither the regular nor {{core}} hive-exec artifacts can work:

* We can't use the regular Hive uber-JAR artifacts because they include many transitive dependencies but do not relocate those dependencies' classes into a private namespace, so this will cause multiple versions of the same class to be included on the classpath. To see this, note the long list of artifacts at https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml#L685 but there is only one relocation pattern (for Kryo).
* We can't use the {{core}}-classified artifact:
** We actually need Kryo to be shaded in {{hive-exec}} because Spark now uses Kryo 3 (which is needed by Chill 0.8.x, which is needed for Scala 2.12) while Hive uses Kryo 2.
** In addition, I think that Spark needs to shade Hive's {{com.google.protobuf:protobuf-java}} dependency.
** The published {{hive-exec}} POM is a "dependency-reduced" POM which doesn't declare {{hive-exec}}'s transitive dependencies. To see this, compare the declared dependencies in the published POM in Maven Central (http://central.maven.org/maven2/org/apache/hive/hive-exec/1.2.1/hive-exec-1.2.1.pom) to the dependencies the source repo's POM:  https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml. The lack of declared dependencies creates an additional layer of pain for us when consuming the {{core}} JAR because we now have to shoulder the burden of declaring explicit dependencies on {{hive-exec}}'s transitive dependencies (since they're no longer bundled in an uber JAR when we use the {{core}} JAR), making it harder to use tools like Maven's {{dependency:tree}} to help us spot potential dep. conflicts.

Spark's current custom Hive fork is effectively making three changes compared to Hive 1.2.1 order to work around the above problems plus some legacy issues which are no longer relevant:

* Remove the shading/bundling of most non-Hive classes, with the exception of Kryo and Protobuf. This has the effect of making the published POM non-dependency-reduced, easing the dep. management story in Spark's POMs, while still ensuring that we relocate classes that conflict with Spark.
* Package the hive-shims into the hive-exec JAR. I don't think that this is strictly necessary.
* Downgrade Kryo to 2.21. This isn't necessary anymore: there was an earlier time where we purposely _unshaded_ Kryo and pinned Hive's version to match Spark's. The only reason that this change is present today was to minimize the diff between versions 1 and 2 of Spark's Hive fork.

For the full details, see https://github.com/apache/hive/compare/release-1.2.1...JoshRosen:release-1.2.1-spark2, which compares the current Version 2 of our Hive fork to stock Hive 1.2.1.

Maven classifiers do not allow the declaration of different dependencies for artifacts depending on their classifiers, so if we wanted to publish a {{hive-exec core}}-like artifact which declares its transitive dependencies then this would need to be done under a new Maven artifact name or new version (e.g. Hive 1.2.2-spark).

That said, proper declaration of transitive dependencies isn't a hard blocker for us: a long, long, long time ago, I think that Spark may have actually built with a stock {{core}} artifact and explicitly declared the transitive deps, so if we've handled that dependency declaration before then we can do it again at the cost of some pain in the future if we want to bump to Hive 2.x.

Therefore, I think the minimal change needed in Hive's build is to add a new classifier, say {{core-spark}}, which behaves like {{core}} except that it shades and relocates Kryo and Protobuf. If this artifact existed then I think Spark could use that classified artifact, declare an explicit dependency on the shim artifacts (assuming Kryo and Protobuf don't need to be shaded there) and explicitly pull in all of {{hive-exec}}'s transitive dependencies. This avoids the need to publish separate _versions_ for Spark: instead, Spark would just consume a differently-packaged/differently-classified version of a stock Hive release.

If we go with this latter approach, then I guess Hive would need to publish 1.2.3 or 1.2.2.1 in order to introduce the new classified artifact.

Does this sound like a reasonable approach? Or would it make more sense to have a separate Hive branch and versioning scheme for Spark (e.g. {{branch-1.2-spark}} and Hive {{1.2.1-spark}})? I lean towards the former approach (releasing 1.2.3 with an additional Spark-specific classifier), especially if we want to fix bugs or make functional / non-packaging changes later down the road (I think [~stevel@apache.org] had a few changes / fixes he wanted to make).

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-16391
>                 URL: https://issues.apache.org/jira/browse/HIVE-16391
>             Project: Hive
>          Issue Type: Task
>          Components: Build Infrastructure
>            Reporter: Reynold Xin
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the only change in the fork is to work around the issue that Hive publishes only two sets of jars: one set with no dependency declared, and another with all the dependencies included in the published uber jar. That is to say, Hive doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked Hive.
> The change in the forked version is recorded here https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)