You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by aarondav <gi...@git.apache.org> on 2014/03/26 08:36:20 UTC
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
GitHub user aarondav opened a pull request:
https://github.com/apache/spark/pull/237
[WIP] SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging
Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.
This patch has the following features/bug fixes:
- Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
- Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
- assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
- avoid adding log message in compute-classpath.sh to the classpath :)
Still TODO before mergeable:
- We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
- Spark SQL documentation updates.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aarondav/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/237.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #237
----
commit d09b3a0dea86a8a16e7901af3b3a801d47515c6b
Author: Aaron Davidson <aa...@databricks.com>
Date: 2014-03-13T18:26:28Z
[WIP] Use SPARK_HIVE to determine if we include Hive in packaging
Previously, we based our decision regarding including datanucleus jars
based on the existence of a spark-hive-assembly jar, which was incididentally
built whenever "sbt assembly" is run. This means that a typical and
previously supported pathway would start using hive jars.
This patch has the following features/bug fixes:
- Use of SPARK_HIVE (default false) to determine if we should include Hive
in the assembly jar.
- Analagous feature in Maven with -Phive.
- assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
Still TODO before mergeable:
- We need to download the datanucleus jars outside of sbt. Perhaps we can have
spark-class download them if SPARK_HIVE is set similar to how sbt downloads
itself.
- Spark SQL documentation updates.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r11324595
--- Diff: bin/compute-classpath.sh ---
@@ -59,7 +45,7 @@ if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"
- DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
+ DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar`
--- End diff --
That is correct. Maven doesn't build assemble-deps right now, so its fine to keep this sbt specific. It raises the question if/how we should add assemble-deps to the maven build, but thats a whole different issue
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38881838
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r11324292
--- Diff: bin/compute-classpath.sh ---
@@ -71,6 +57,23 @@ else
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR"
fi
+# When Hive support is needed, Datanucleus jars must be included on the classpath.
+# Datanucleus jars do not work if only included in the uberjar as plugin.xml metadata is lost.
+# Both sbt and maven will populate "lib_managed/jars/" with the datanucleus jars when Spark is
+# built with Hive, so first check if the datanucleus jars exist, and then ensure the current Spark
+# assembly is built for Hive, before actually populating the CLASSPATH with the jars.
+# Note that this check order is faster (by up to half a second) in the case where Hive is not used.
+num_datanucleus_jars=$(ls "$FWDIR"/lib_managed/jars/ | grep "datanucleus-.*\\.jar" | wc -l)
+if [ $num_datanucleus_jars -gt 0 ]; then
+ AN_ASSEMBLY_JAR=${ASSEMBLY_JAR:-$DEPS_ASSEMBLY_JAR}
+ num_hive_files=$(jar tvf "$AN_ASSEMBLY_JAR" org/apache/hadoop/hive/ql/exec 2>/dev/null | wc -l)
--- End diff --
also an extra space after `tvf` unless it's intentional?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39655971
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39654636
Build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39659222
I tested this in maven and sbt builds. Seemed to work for me!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39654659
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13809/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38718243
Ah sorry I didn't read your description. Indeed, this wouldn't work for maven. Having a file flag seems potentially complicated if someone does multiple builds some with and some without hive.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r10988066
--- Diff: sql/hive/pom.xml ---
@@ -53,6 +53,12 @@
<version>${hive.version}</version>
</dependency>
<dependency>
+ <!-- Matches the version of jackson-core-asl pulled in by avro -->
+ <groupId>org.codehaus.jackson</groupId>
+ <artifactId>jackson-mapper-asl</artifactId>
+ <version>1.8.8</version>
--- End diff --
should the version number be declared in the parent pom and then the dependency added here? Btw one reason why this might be getting messed up is that we intentionally exclude jackson from a bunch of dependencies. So maybe we are actually excluding it when pulling in Avro.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1314: Use SPARK_HIVE to determine if we ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39688763
Thanks a lot for looking at this Aaron, I've merged it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1314: Use SPARK_HIVE to determine if we ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39674358
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1314: Use SPARK_HIVE to determine if we ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39675356
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13821/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39655244
Also fixes SPARK-1309.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38658678
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38656860
@marmbrus @pwendell Please take a look, my build-foo is not strong.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39655972
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13811/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38715665
One other option is to have a file flag like we do for `RELEASE` that is used in addition to the env var.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38662413
Merged build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r10966616
--- Diff: pom.xml ---
@@ -373,7 +373,6 @@
<groupId>org.apache.derby</groupId>
<artifactId>derby</artifactId>
<version>10.4.2.0</version>
- <scope>test</scope>
--- End diff --
Hive requires derby in the compile scope via a transitive dependency on hive-metastore, and this setting was overriding that. This does not seem to be pulled in from non-hive assembly jars.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38658679
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1314: Use SPARK_HIVE to determine if we ...
Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39674190
Addressed comments and removed WIP.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1314: Use SPARK_HIVE to determine if we ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39677875
LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39655300
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r11324287
--- Diff: bin/compute-classpath.sh ---
@@ -59,7 +45,7 @@ if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"
- DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
+ DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar`
--- End diff --
Are you sure this is valid? I seem to remember the maven and sbt builds name the assembly jar differently (?) If this is the right way to do it would it make sense to make the check in the `if` statement consistent with this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r11323799
--- Diff: project/SparkBuild.scala ---
@@ -395,8 +404,6 @@ object SparkBuild extends Build {
// assembly jar.
def hiveSettings = sharedSettings ++ assemblyProjSettings ++ Seq(
--- End diff --
Good catch, removed!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1314: Use SPARK_HIVE to determine if we ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39674350
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38716186
One issue with the "including datanucleus if it exists" approach is that we need to at some point download datanucleus for the user. Right now sbt does that via lib_managed, but there's no pathway to getting the datanucleus jars from maven, so we may need an external script that's run in spark-class. This would require the same type of runtime option as we have now, or a file flag.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38662415
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13466/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r11324565
--- Diff: bin/compute-classpath.sh ---
@@ -59,7 +45,7 @@ if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"
- DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
+ DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar`
--- End diff --
maven doesn't build assemble-deps, as far as I know
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38715156
Thanks Aaron - looks good. From what I can tell you've just copied the model used for ganglia and previously YARN.
The only ugliness is that SPARK_HIVE is now needed at runtime to determine whether to include the Datanucleus jars. And this is conflated with the setting at compile-time which has a different meaning. This is a bit unfortunate - is there no better way here? We could have the assembly name be different if Hive is included, but that's sort of ugly too. We could also just include the datanucleus jars on the classpath if they are present. The only downside is that if someone did a build for hive, then did a normal build, they would still include it. But in that case maybe it's just okay to include them - it's not a widely used library.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39655298
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1314: Use SPARK_HIVE to determine if we ...
Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r11327251
--- Diff: bin/compute-classpath.sh ---
@@ -59,7 +45,7 @@ if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"
- DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
+ DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar`
--- End diff --
It may actually be not too hard to add assemble-deps to Maven if we have a build that inherits from "assembly" and simply excludes Spark's groupId. Though, packaging the Maven assembly is roughly 5x faster than SBT.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39654566
Alright, I changed up the solution by having Maven also download the datanucleus jars into lib_managed/jars/ (unlike SBT, Maven will only download these jars, and only when building Spark with Hive). Now we will include datanucleus on the classpath if they're present in lib_managed and the Spark assembly includes Hive (by looking up with the "jar" command).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1314: Use SPARK_HIVE to determine if we ...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/237
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r11324290
--- Diff: bin/compute-classpath.sh ---
@@ -71,6 +57,23 @@ else
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR"
fi
+# When Hive support is needed, Datanucleus jars must be included on the classpath.
+# Datanucleus jars do not work if only included in the uberjar as plugin.xml metadata is lost.
--- End diff --
extra space "the uberjar"
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1314: Use SPARK_HIVE to determine if we ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39675354
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39654658
Build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-39654631
Build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/237#discussion_r10996156
--- Diff: project/SparkBuild.scala ---
@@ -395,8 +404,6 @@ object SparkBuild extends Build {
// assembly jar.
def hiveSettings = sharedSettings ++ assemblyProjSettings ++ Seq(
--- End diff --
Any reason we still need a separate hive assembly target ? If hive will be a part of the normal assembly jar (based on maybeHive) then these rules should be moved into assemblyProjSettings ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [WIP] SPARK-1314: Use SPARK_HIVE to determine ...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/237#issuecomment-38715395
We could also include them and log a message like "Including Datanucleus jars needed for Hive" and then if someone happens to not want them (they previously did a Hive assembly) then they'd know what was going on.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---