You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2014/05/12 12:35:14 UTC
[jira] [Commented] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

    [ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994990#comment-13994990 ] 

Sean Owen commented on SPARK-1802:
----------------------------------

[~pwendell] You can see my start on it here:

https://github.com/srowen/spark/commits/SPARK-1802
https://github.com/srowen/spark/commit/a856604cfc67cb58146ada01fda6dbbb2515fa00

This resolves the new issues you note in your diff.


Next issue is that hive-exec, quite awfully, includes a copy of all of its transitive dependencies in its artifact. See https://issues.apache.org/jira/browse/HIVE-5733 and note the warnings you'll get during assembly:

{code}
[WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping classes: 
[WARNING]   - org.apache.thrift.transport.TSaslTransport$SaslResponse
...
{code}

hive-exec is in fact used in this module. Aside from actual surgery on the artifact with the shade plugin, you can't control the dependencies as a result. This may be simply "the best that can be done" right now. If it has worked, it has worked.


Am I right that the datanucleus JARs *are* meant to be in the assembly, only for the Hive build?
https://github.com/apache/spark/pull/688
https://github.com/apache/spark/pull/610

That's good if so since that's what your diff shows.


Finally, while we're here, I note that there are still a few JAR conflicts that turn up when you build the assembly *without* Hive. (I'm going to ignore conflicts in examples; these can be cleaned up but aren't really a big deal given its nature.)  We could touch those up too.

This is in the normal build (and I know how to zap most of this problem):
{code}
[WARNING] commons-beanutils-core-1.8.0.jar, commons-beanutils-1.7.0.jar define 82 overlappping classes: 
{code}

These turn up in the Hadoop 2.x + YARN build:
{code}
[WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 overlappping classes: 
...
[WARNING] jcl-over-slf4j-1.7.5.jar, commons-logging-1.1.3.jar define 6 overlappping classes: 
...
[WARNING] activation-1.1.jar, javax.activation-1.1.0.v201105071233.jar define 17 overlappping classes: 
...
[WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 overlappping classes: 
{code}

These should be easy to track down. Shall I?

> Audit dependency graph when Spark is built with -Phive
> ------------------------------------------------------
>
>                 Key: SPARK-1802
>                 URL: https://issues.apache.org/jira/browse/SPARK-1802
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Patrick Wendell
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes.
> {code}
> $ mvn install -Phive -DskipTests && mvn dependency:build-classpath -pl assembly | grep -v INFO | tr ":" "\n" |  awk ' { FS="/"; print ( $(NF) ); }' | sort > without_hive.txt
> $ mvn install -Phive -DskipTests && mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr ":" "\n" |  awk ' { FS="/"; print ( $(NF) ); }' | sort > with_hive.txt
> $ diff without_hive.txt with_hive.txt
> < antlr-2.7.7.jar
> < antlr-3.4.jar
> < antlr-runtime-3.4.jar
> 10,14d6
> < avro-1.7.4.jar
> < avro-ipc-1.7.4.jar
> < avro-ipc-1.7.4-tests.jar
> < avro-mapred-1.7.4.jar
> < bonecp-0.7.1.RELEASE.jar
> 22d13
> < commons-cli-1.2.jar
> 25d15
> < commons-compress-1.4.1.jar
> 33,34d22
> < commons-logging-1.1.1.jar
> < commons-logging-api-1.0.4.jar
> 38d25
> < commons-pool-1.5.4.jar
> 46,49d32
> < datanucleus-api-jdo-3.2.1.jar
> < datanucleus-core-3.2.2.jar
> < datanucleus-rdbms-3.2.1.jar
> < derby-10.4.2.0.jar
> 53,57d35
> < hive-common-0.12.0.jar
> < hive-exec-0.12.0.jar
> < hive-metastore-0.12.0.jar
> < hive-serde-0.12.0.jar
> < hive-shims-0.12.0.jar
> 60,61d37
> < httpclient-4.1.3.jar
> < httpcore-4.1.3.jar
> 68d43
> < JavaEWAH-0.3.2.jar
> 73d47
> < javolution-5.5.1.jar
> 76d49
> < jdo-api-3.0.1.jar
> 78d50
> < jetty-6.1.26.jar
> 87d58
> < jetty-util-6.1.26.jar
> 93d63
> < json-20090211.jar
> 98d67
> < jta-1.1.jar
> 103,104d71
> < libfb303-0.9.0.jar
> < libthrift-0.9.0.jar
> 112d78
> < mockito-all-1.8.5.jar
> 136d101
> < servlet-api-2.5-20081211.jar
> 139d103
> < snappy-0.2.jar
> 144d107
> < spark-hive_2.10-1.0.0.jar
> 151d113
> < ST4-4.0.4.jar
> 153d114
> < stringtemplate-3.2.1.jar
> 156d116
> < velocity-1.7.jar
> 158d117
> < xz-1.0.jar
> {code}
> Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)