You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by srowen <gi...@git.apache.org> on 2014/05/04 11:25:05 UTC

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/629

    SPARK-1556. jets3t dep doesn't update properly with newer Hadoop versions

    See related discussion at https://github.com/apache/spark/pull/468
    
    This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`.
    
    - Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows.
    - Removes `hadoop.major.version`
    - Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes:
      - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue
      - like the jets3t version issue now
    - Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden
    - _(YARN profiles in the parent now only exist to add the sub-module)_
    - Fixes the jets3t dependency issue
     - and makes it a runtime dependency
     - and centralizes config of this guy in the parent pom
    - Updates build docs
    - Updates SBT build too
      - and fixes a regex problem along the way

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-1556

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/629.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #629
    
----
commit f21f35651dc8e9b2036f5e487f7465d844d35c72
Author: Sean Owen <so...@cloudera.com>
Date:   2014-05-04T09:08:15Z

    Build changes to set up for jets3t fix

commit bbed8262fbaaad8037781e0e32b5141774250839
Author: Sean Owen <so...@cloudera.com>
Date:   2014-05-04T09:15:30Z

    Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build)

commit 274f4f989a3fd27aba2ede75d3c7b713547aab68
Author: Sean Owen <so...@cloudera.com>
Date:   2014-05-04T09:21:58Z

    Make jets3t a runtime dependency, and bring its exclusion up into parent config

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42128538
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/629#discussion_r12266937
  
    --- Diff: docs/building-with-maven.md ---
    @@ -42,22 +54,40 @@ For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop versions wit
         # Apache Hadoop 0.23.x
         $ mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
     
    -For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and set the "hadoop.version", "yarn.version" property. Note that Hadoop 0.23.X requires a special `-Phadoop-0.23` profile:
    +For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and optionally set the "yarn.version" property if it is different from "hadoop.version". The additional build profile required depends on the YARN version:
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>YARN version</th><th>Profile required</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr><td>0.23.x to 2.1.x</td><td>yarn-alpha</td></tr>
    +    <tr><td>2.2.x and later</td><td>yarn</td></tr>
    +  </tbody>
    +</table>
    +
    +Examples:
     
         # Apache Hadoop 2.0.5-alpha
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
     
    -    # Cloudera CDH 4.2.0 with MapReduce v2
    +    # Cloudera CDH 4.2.0
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
     
    -    # Apache Hadoop 2.2.X (e.g. 2.2.0 as below) and newer
    -    $ mvn -Pyarn -Dhadoop.version=2.2.0 -DskipTests clean package
    -
         # Apache Hadoop 0.23.x
    -    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -Dyarn.version=0.23.7 -DskipTests clean package
    +    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
    +
    +    # Apache Hadoop 2.2.X
    +    $ mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
    --- End diff --
    
    I think it might be better to always ask people to specify `hadoop.version` and then just explain that in some cases they need to add a profile to work around problems in the hadoop dependency graph. Otherwise sometimes we are relying on the profile to set `hadoop.version` and it could be a bit confusing to users what is going on.
    
    ```
    mvn -Pyarn -Dhadoop.version=2.2.X -Phadoop-2.2  -DskipTests clean package
    ```
    
    The header here says "Apache Hadoop 2.2.X" but the actual example can't be directly generalized to 2.2.X without them digging around the build more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/629#discussion_r12266975
  
    --- Diff: docs/building-with-maven.md ---
    @@ -42,22 +54,40 @@ For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop versions wit
         # Apache Hadoop 0.23.x
         $ mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
     
    -For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and set the "hadoop.version", "yarn.version" property. Note that Hadoop 0.23.X requires a special `-Phadoop-0.23` profile:
    +For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and optionally set the "yarn.version" property if it is different from "hadoop.version". The additional build profile required depends on the YARN version:
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>YARN version</th><th>Profile required</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr><td>0.23.x to 2.1.x</td><td>yarn-alpha</td></tr>
    +    <tr><td>2.2.x and later</td><td>yarn</td></tr>
    +  </tbody>
    +</table>
    +
    +Examples:
     
         # Apache Hadoop 2.0.5-alpha
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
     
    -    # Cloudera CDH 4.2.0 with MapReduce v2
    +    # Cloudera CDH 4.2.0
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
     
    -    # Apache Hadoop 2.2.X (e.g. 2.2.0 as below) and newer
    -    $ mvn -Pyarn -Dhadoop.version=2.2.0 -DskipTests clean package
    -
         # Apache Hadoop 0.23.x
    -    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -Dyarn.version=0.23.7 -DskipTests clean package
    +    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
    +
    +    # Apache Hadoop 2.2.X
    +    $ mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
    +
    +    # Apache Hadoop 2.3.X and newer
    +    $ mvn -Pyarn -Phadoop-2.3 -DskipTests clean package
    +
    +    # Apache Hadoop 2.4.X as a custom version
    +    $ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.4.0 -DskipTests clean package
    --- End diff --
    
    For instance, right now we suggest to use this for all hadoop-2.3+, but who knows if Hadoop will change it's dep graph in the future such that those builds don't actually work. It might be better to just create individual profiles for the ones we know we currently support via this workaround.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/629#discussion_r12261171
  
    --- Diff: core/pom.xml ---
    @@ -38,12 +38,6 @@
         <dependency>
           <groupId>net.java.dev.jets3t</groupId>
           <artifactId>jets3t</artifactId>
    -      <exclusions>
    -        <exclusion>
    -          <groupId>commons-logging</groupId>
    -          <artifactId>commons-logging</artifactId>
    -        </exclusion>
    -      </exclusions>
    --- End diff --
    
    It is only moved up to the parent where similar config resides.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42175676
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42175679
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14662/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42173644
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42171989
  
    OK, but leave default `hadoop.version=1.0.4`? OK. I can make a `hadoop-2.4` profile if you're not concerned about having the extra stanza to maintain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/629


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42132419
  
    Looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by darose <gi...@git.apache.org>.
Github user darose commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42672273
  
    Thanks for the suggestions.  I don't think this is a deployment issue though.  I don't have any spark/shark remnants installed from packages on the client machine.  (I don't even have an /opt/cloudera directory - my Cloudera packages seem to get installed under /usr/lib/hadoop, /usr/lib/hive, etc.).  Rather, I was manually deploying the binaries I built to /usr/lib/spark and /usr/lib/shark, and I've been completely removing those directory trees each time I do a new build.
    
    And similarly on the Hadoop cluster machines:  These are Amazon EC2 AMI's that I'm building, off of a fresh pristine Ubuntu 13.10 base, so there's no spark/shark remnants present before I start.
    
    So I'm fairly certain this is an issue of me building incorrectly.  I think what I did to build was:
    * grab the current master branch of spark & shark
    * update SparkBuild.scala to downgrade the version from 1.0-SNAPSHOT to 0.9.1
    * update SharkBuild.scala to use jets3t 0.9.0
    * build both with sbt assembly and sbt package
    * when done, copy the versions of spark-core_2.10-0.9.1.jar, spark-bagel_2.10-0.9.1.jar, spark-mllib_2.10-0.9.1.jar, and spark-repl_2.10-0.9.1.jar that were generated during the spark build and use them to replace the corresponding jars in lib_managed in the shark build.  (My thinking here was that the shark build was pulling those jars from maven, and that perhaps the "class incompatible" issue was being caused by spark and shark using different versions of those jars.)
    
    But result was the json4s issue I posted above.
    
    In any case, as this is a build/deployment issue, probably best for me to take this off GitHub.  Would be very grateful, though, if you might be able to assist in getting a working spark/shark build for us.  Our company is a Cloudera support customer, so I'll try following up through those channels.  If you don't mind, I'll suggest that the support team get in touch with you about this, as you're obviously the most well-versed on the issue.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42213788
  
    All set here. I think it does fix the issue and make future, similar changes easier to apply in maven, so I'm for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42173066
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/629#discussion_r12266958
  
    --- Diff: docs/building-with-maven.md ---
    @@ -42,22 +54,40 @@ For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop versions wit
         # Apache Hadoop 0.23.x
         $ mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
     
    -For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and set the "hadoop.version", "yarn.version" property. Note that Hadoop 0.23.X requires a special `-Phadoop-0.23` profile:
    +For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and optionally set the "yarn.version" property if it is different from "hadoop.version". The additional build profile required depends on the YARN version:
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>YARN version</th><th>Profile required</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr><td>0.23.x to 2.1.x</td><td>yarn-alpha</td></tr>
    +    <tr><td>2.2.x and later</td><td>yarn</td></tr>
    +  </tbody>
    +</table>
    +
    +Examples:
     
         # Apache Hadoop 2.0.5-alpha
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
     
    -    # Cloudera CDH 4.2.0 with MapReduce v2
    +    # Cloudera CDH 4.2.0
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
     
    -    # Apache Hadoop 2.2.X (e.g. 2.2.0 as below) and newer
    -    $ mvn -Pyarn -Dhadoop.version=2.2.0 -DskipTests clean package
    -
         # Apache Hadoop 0.23.x
    -    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -Dyarn.version=0.23.7 -DskipTests clean package
    +    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
    +
    +    # Apache Hadoop 2.2.X
    +    $ mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
    +
    +    # Apache Hadoop 2.3.X and newer
    +    $ mvn -Pyarn -Phadoop-2.3 -DskipTests clean package
    --- End diff --
    
    Here also, I think it would be good to tell the user to set `hadoop.version` explicitly, to make it easier to generalize.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42175500
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42128534
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42158974
  
    Thanks @srowen! I tested this and it seems to work well (tests below).
    
    Comments are inline. My main thoughts were:
    1. It might be nicer to ask people to always set `hadoop.version`, it's just a bit more explicit and IMO less confusing. Of course, advanced users can rely on the profile to set the version.
    2. It could be nice to make distinct profiles for 2.3 and 2.4... then it will be clear to users if they want to build against 2.5+ they are in uncharted territory, since we can't officially support those builds until they come out.
    
    ```
    mvn -Pyarn -Phadoop-2.3 -DskipTests clean package
    ./bin/spark-shell
    scala> sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/rankings/").count
    res1: Long = 1200
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42129265
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/629#discussion_r12266954
  
    --- Diff: docs/building-with-maven.md ---
    @@ -42,22 +54,40 @@ For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop versions wit
         # Apache Hadoop 0.23.x
         $ mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
     
    -For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and set the "hadoop.version", "yarn.version" property. Note that Hadoop 0.23.X requires a special `-Phadoop-0.23` profile:
    +For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and optionally set the "yarn.version" property if it is different from "hadoop.version". The additional build profile required depends on the YARN version:
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>YARN version</th><th>Profile required</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr><td>0.23.x to 2.1.x</td><td>yarn-alpha</td></tr>
    +    <tr><td>2.2.x and later</td><td>yarn</td></tr>
    +  </tbody>
    +</table>
    +
    +Examples:
     
         # Apache Hadoop 2.0.5-alpha
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
     
    -    # Cloudera CDH 4.2.0 with MapReduce v2
    +    # Cloudera CDH 4.2.0
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
     
    -    # Apache Hadoop 2.2.X (e.g. 2.2.0 as below) and newer
    -    $ mvn -Pyarn -Dhadoop.version=2.2.0 -DskipTests clean package
    -
         # Apache Hadoop 0.23.x
    -    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -Dyarn.version=0.23.7 -DskipTests clean package
    +    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
    +
    +    # Apache Hadoop 2.2.X
    +    $ mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
    +
    +    # Apache Hadoop 2.3.X and newer
    +    $ mvn -Pyarn -Phadoop-2.3 -DskipTests clean package
    +
    +    # Apache Hadoop 2.4.X as a custom version
    +    $ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.4.0 -DskipTests clean package
    --- End diff --
    
    Should we just make a profile called `hadoop-2.4`, even if the dependencies don't change... it might make it simpler for users to reason about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42616622
  
    So this means that the Java-serialized format of a Spark class has changed (or at least, it may have; the serialVersionUID changed) between the versions of two Spark components that you are using. I doubt it is related to this PR per se, but is related to having two versions of Spark in play -- what you just compiled, and whatever else may be on your machines? I see you distributed the updated code to some machines, but maybe not all, or, are you sure the running processes are all the same new version? At the least, this error suggests it isn't.
    
    If so the fix (which is a separate issue) is to really run the same version of everything. I suppose it may be worth looking into whether the serialized form of this class really changed or not and whether management of its serialVersionUID might resolve what is really not a problem, but that's less good than matching versions everywhere, which should ensure this can't happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by darose <gi...@git.apache.org>.
Github user darose commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42602116
  
    I tried building an running this code, but it failed with:
    
    14/05/08 20:09:11 ERROR Remoting: org.apache.spark.deploy.ApplicationDescription; local class incompatible: stream classdesc serialVersionUID = -6451051318873184044, local class serialVersionUID = 5837456792360$
    1411
    java.io.InvalidClassException: org.apache.spark.deploy.ApplicationDescription; local class incompatible: stream classdesc serialVersionUID = -6451051318873184044, local class serialVersionUID = 5837456792360714$
    1
            at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
            at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
            at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
            at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
            at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
            at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
            at akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
    ...
    
    * I grabbed the source zip from https://github.com/apache/spark/tree/73b0cbcc241cca3d318ff74340e80b02f884acbd
    * I built it with "SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 SPARK_YARN=true sbt/sbt assembly" followed by "SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 SPARK_YARN=true sbt/sbt package"
    * I grabbed shark 0.9.1 source and built it with "SHARK_HADOOP_VERSION=2.3.0-cdh5.0.0 ./sbt/sbt assembly" and "SHARK_HADOOP_VERSION=2.3.0-cdh5.0.0 ./sbt/sbt package"
    * I deployed both to a client, a master, and a worker machine
    
    Master and worker can talk to each other, worker can register with master.  I can launch a shark shell on the client, and do basic things like "show tables".  But when I try a simple SELECT query, I get the above error.
    
    Any idea what I'm doing wrong?
    
    Any idea what I might be doing wrong?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42213357
  
    @srowen looks good to me! Is there anything else you plan to add to this or is it ready to go?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42129266
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14641/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42173056
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/629#discussion_r12272908
  
    --- Diff: docs/building-with-maven.md ---
    @@ -42,22 +55,40 @@ For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop versions wit
         # Apache Hadoop 0.23.x
         $ mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
     
    -For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and set the "hadoop.version", "yarn.version" property. Note that Hadoop 0.23.X requires a special `-Phadoop-0.23` profile:
    +For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and optionally set the "yarn.version" property if it is different from "hadoop.version". The additional build profile required depends on the YARN version:
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>YARN version</th><th>Profile required</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr><td>0.23.x to 2.1.x</td><td>yarn-alpha</td></tr>
    +    <tr><td>2.2.x and later</td><td>yarn</td></tr>
    +  </tbody>
    +</table>
    +
    +Examples:
     
         # Apache Hadoop 2.0.5-alpha
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
     
    -    # Cloudera CDH 4.2.0 with MapReduce v2
    +    # Cloudera CDH 4.2.0
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
     
    -    # Apache Hadoop 2.2.X (e.g. 2.2.0 as below) and newer
    -    $ mvn -Pyarn -Dhadoop.version=2.2.0 -DskipTests clean package
    -
         # Apache Hadoop 0.23.x
    -    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -Dyarn.version=0.23.7 -DskipTests clean package
    +    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
    +
    +    # Apache Hadoop 2.2.X
    +    $ mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
    +
    +    # Apache Hadoop 2.3.X
    +    $ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
    +
    +    # Apache Hadoop 2.4.X
    +    $ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.4.0 -DskipTests clean package
    --- End diff --
    
    Should be `-Phadoop-2.4` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42175502
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14661/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/629#discussion_r12261160
  
    --- Diff: core/pom.xml ---
    @@ -38,12 +38,6 @@
         <dependency>
           <groupId>net.java.dev.jets3t</groupId>
           <artifactId>jets3t</artifactId>
    -      <exclusions>
    -        <exclusion>
    -          <groupId>commons-logging</groupId>
    -          <artifactId>commons-logging</artifactId>
    -        </exclusion>
    -      </exclusions>
    --- End diff --
    
    Why remove it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42173652
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/629#discussion_r12272356
  
    --- Diff: docs/building-with-maven.md ---
    @@ -42,22 +54,40 @@ For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop versions wit
         # Apache Hadoop 0.23.x
         $ mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
     
    -For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and set the "hadoop.version", "yarn.version" property. Note that Hadoop 0.23.X requires a special `-Phadoop-0.23` profile:
    +For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and optionally set the "yarn.version" property if it is different from "hadoop.version". The additional build profile required depends on the YARN version:
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>YARN version</th><th>Profile required</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr><td>0.23.x to 2.1.x</td><td>yarn-alpha</td></tr>
    +    <tr><td>2.2.x and later</td><td>yarn</td></tr>
    +  </tbody>
    +</table>
    +
    +Examples:
     
         # Apache Hadoop 2.0.5-alpha
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
     
    -    # Cloudera CDH 4.2.0 with MapReduce v2
    +    # Cloudera CDH 4.2.0
         $ mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
     
    -    # Apache Hadoop 2.2.X (e.g. 2.2.0 as below) and newer
    -    $ mvn -Pyarn -Dhadoop.version=2.2.0 -DskipTests clean package
    -
         # Apache Hadoop 0.23.x
    -    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -Dyarn.version=0.23.7 -DskipTests clean package
    +    $ mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
    +
    +    # Apache Hadoop 2.2.X
    +    $ mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
    --- End diff --
    
    Why doesn't this work for Hadoop 2.2.0? I can fix the example. It should be matter of setting the version and profile, unless there is something deeper. In which case maybe the build needs to change in the hadoop-2.2 profile?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42644509
  
    `json4s` is a direct dependency of Spark core. It's not related to the changes here. It sounds like something is running but being pointed to an artifact that does not contain dependencies, and they are not otherwise on the classpath.
    
    I am not sure how you're modifying/running this, but I _suspect_ you continue to collide with existing 0.9.0 binaries and environment variables on your cluster. Building matching binaries is the easy part. It's deploying it manually that may be difficult. It should just be a matter of scraping out and replacing the stuff you find in `/opt/cloudera/.../spark`, but you may have to be sure to replace many things, not just one assembly jar. You'd have to restart the services too. And this has to happen on all workers.
    
    Longer-term of course an updated Spark will be deployed with CDH updates anyway.
    
    If you just want to move off Hive, there's another answer for that, but that's a different question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42673811
  
    Although there's now a 'fix' in, it is only going to do automatically what you are doing by hand.
    My hunch is that perhaps the script, or env variable, already contains a classpath with "0.9.0" jars in it, which means that when they're replaced with differently-named jars, suddenly it can't find basic classes. Just a guess.
    
    (Not sure what support will do with it given that this is into unsupported territory. You can ask. There's a reasonable question here about understanding the deployment. They'll reach out to the right people as needed.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

Posted by darose <gi...@git.apache.org>.
Github user darose commented on the pull request:

    https://github.com/apache/spark/pull/629#issuecomment-42635850
  
    I tried downloading the latest from the master branch of both spark and shark, and built and ran them (in the hopes of getting past that "class incompatible" issue) but no luck there either.  Now it's dying on "Exception in thread "main" java.lang.NoClassDefFoundError: org/json4s/JsonAST$JValue"
    
    I'm really at wits end here - have spent days on trying to get this to run.  Is there a matched set of spark & shark binaries available somewhere that can run on CDH5?  (I.e., that includes this jets3t fix.)  Or if not, could someone provide instructions on how I might build them?
    
    I'm going to have to fall back to slow-as-molasses Hive if I can't find a way through this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---