You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jey Kottalam <je...@cs.berkeley.edu> on 2013/08/21 05:39:02 UTC

Important: Changes to Spark's build system on master branch

The master branch of Spark has been updated with PR #838, which
changes aspects of Spark's interface to Hadoop. This involved also
making changes to Spark's build system as documented below. The
documentation will be updated with this information shortly.

Please feel free to reply to this thread with any questions or if you
encounter any problems.

-Jey



When Building Spark
===============

- General: The default version of Hadoop has been updated to 1.2.1 from 1.0.4.

- General: You will probably need to perform an "sbt clean" or "mvn
clean" to remove old build files. SBT users may also need to perform a
"clean" when changing Hadoop versions (or at least delete the
lib_managed directory).

- SBT users: The version of Hadoop used can be specified by setting
the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
Example:

    # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
    SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly

    # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
    SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
package assembly

- Maven users: Set the Hadoop version built against by editing the
"pom.xml" file in the root directory and changing the "hadoop.version"
property (and, the "yarn.version" property if applicable). If you are
building with YARN disabled, you no longer need to enable any Maven
profiles (i.e. "-P" flags). To build with YARN enabled, use the
"hadoop2-yarn" Maven profile. Example:

- The "make-distribution.sh" script has been updated to take
additional parameters to select the Hadoop version and enable YARN.



When Writing Spark Applications
========================


- Non-YARN users: If you wish to use HDFS, you will need to add the
appropriate version of the "hadoop-client" artifact from the
"org.apache.hadoop" group to your project.

    SBT example:
        // "force()" is required because "1.1.0" is less than Spark's
default of "1.2.1"
        "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()

    Maven example:
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-client</artifactId>
          <!-- the brackets are needed to tell Maven that this is a
hard dependency on version "1.1.0" exactly -->
          <version>[1.1.0]</version>
        </dependency>


- YARN users: You will now need to set SPARK_JAR to point to the
spark-yarn assembly instead of the spark-core assembly previously
used.

  SBT Example:
       SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
        ./run spark.deploy.yarn.Client \
          --jar
$PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
\
          --class spark.examples.SparkPi --args yarn-standalone \
          --num-workers 3 --worker-memory 2g --master-memory 2g --worker-cores 1

Re: Important: Changes to Spark's build system on master branch

Posted by Henry Saputra <he...@gmail.com>.

Ah cool, thanks Jey

- Henry


On Tue, Aug 20, 2013 at 10:14 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:

> Hi Henry,
>
> Yes, that is accurate to my knowledge. These changes were merged to
> the Github-hosted Spark repository at http://github.com/mesos/spark
> today (Aug 20) as part of pull request #838
> (https://github.com/mesos/spark/pull/838). The Apache-hosted
> repository at https://git-wip-us.apache.org/repos/asf/incubator-spark.git
>  does not appear to have these changes.
>
> -Jey
>
> On Tue, Aug 20, 2013 at 10:02 PM, Henry Saputra <he...@gmail.com>
> wrote:
> > Hi Jey, just want to clarify that the changes happen to master branch in
> the
> > github and not the Apache git repository?
> >
> > Thanks,
> >
> > Henry
> >
> >
> > On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu>
> wrote:
> >>
> >> The master branch of Spark has been updated with PR #838, which
> >> changes aspects of Spark's interface to Hadoop. This involved also
> >> making changes to Spark's build system as documented below. The
> >> documentation will be updated with this information shortly.
> >>
> >> Please feel free to reply to this thread with any questions or if you
> >> encounter any problems.
> >>
> >> -Jey
> >>
> >>
> >>
> >> When Building Spark
> >> ===============
> >>
> >> - General: The default version of Hadoop has been updated to 1.2.1 from
> >> 1.0.4.
> >>
> >> - General: You will probably need to perform an "sbt clean" or "mvn
> >> clean" to remove old build files. SBT users may also need to perform a
> >> "clean" when changing Hadoop versions (or at least delete the
> >> lib_managed directory).
> >>
> >> - SBT users: The version of Hadoop used can be specified by setting
> >> the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
> >> YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
> >> Example:
> >>
> >>     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
> >>     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
> >>
> >>     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
> >>     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
> >> package assembly
> >>
> >> - Maven users: Set the Hadoop version built against by editing the
> >> "pom.xml" file in the root directory and changing the "hadoop.version"
> >> property (and, the "yarn.version" property if applicable). If you are
> >> building with YARN disabled, you no longer need to enable any Maven
> >> profiles (i.e. "-P" flags). To build with YARN enabled, use the
> >> "hadoop2-yarn" Maven profile. Example:
> >>
> >> - The "make-distribution.sh" script has been updated to take
> >> additional parameters to select the Hadoop version and enable YARN.
> >>
> >>
> >>
> >> When Writing Spark Applications
> >> ========================
> >>
> >>
> >> - Non-YARN users: If you wish to use HDFS, you will need to add the
> >> appropriate version of the "hadoop-client" artifact from the
> >> "org.apache.hadoop" group to your project.
> >>
> >>     SBT example:
> >>         // "force()" is required because "1.1.0" is less than Spark's
> >> default of "1.2.1"
> >>         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
> >>
> >>     Maven example:
> >>         <dependency>
> >>           <groupId>org.apache.hadoop</groupId>
> >>           <artifactId>hadoop-client</artifactId>
> >>           <!-- the brackets are needed to tell Maven that this is a
> >> hard dependency on version "1.1.0" exactly -->
> >>           <version>[1.1.0]</version>
> >>         </dependency>
> >>
> >>
> >> - YARN users: You will now need to set SPARK_JAR to point to the
> >> spark-yarn assembly instead of the spark-core assembly previously
> >> used.
> >>
> >>   SBT Example:
> >>
>  SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
> >>         ./run spark.deploy.yarn.Client \
> >>           --jar
> >> $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
> >> \
> >>           --class spark.examples.SparkPi --args yarn-standalone \
> >>           --num-workers 3 --worker-memory 2g --master-memory 2g
> >> --worker-cores 1
> >
> >
>

Re: Important: Changes to Spark's build system on master branch

Posted by Jey Kottalam <je...@cs.berkeley.edu>.

Hi Henry,

Yes, that is accurate to my knowledge. These changes were merged to
the Github-hosted Spark repository at http://github.com/mesos/spark
today (Aug 20) as part of pull request #838
(https://github.com/mesos/spark/pull/838). The Apache-hosted
repository at https://git-wip-us.apache.org/repos/asf/incubator-spark.git
 does not appear to have these changes.

-Jey

On Tue, Aug 20, 2013 at 10:02 PM, Henry Saputra <he...@gmail.com> wrote:
> Hi Jey, just want to clarify that the changes happen to master branch in the
> github and not the Apache git repository?
>
> Thanks,
>
> Henry
>
>
> On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
>>
>> The master branch of Spark has been updated with PR #838, which
>> changes aspects of Spark's interface to Hadoop. This involved also
>> making changes to Spark's build system as documented below. The
>> documentation will be updated with this information shortly.
>>
>> Please feel free to reply to this thread with any questions or if you
>> encounter any problems.
>>
>> -Jey
>>
>>
>>
>> When Building Spark
>> ===============
>>
>> - General: The default version of Hadoop has been updated to 1.2.1 from
>> 1.0.4.
>>
>> - General: You will probably need to perform an "sbt clean" or "mvn
>> clean" to remove old build files. SBT users may also need to perform a
>> "clean" when changing Hadoop versions (or at least delete the
>> lib_managed directory).
>>
>> - SBT users: The version of Hadoop used can be specified by setting
>> the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
>> YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
>> Example:
>>
>>     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
>>     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
>>
>>     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
>>     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
>> package assembly
>>
>> - Maven users: Set the Hadoop version built against by editing the
>> "pom.xml" file in the root directory and changing the "hadoop.version"
>> property (and, the "yarn.version" property if applicable). If you are
>> building with YARN disabled, you no longer need to enable any Maven
>> profiles (i.e. "-P" flags). To build with YARN enabled, use the
>> "hadoop2-yarn" Maven profile. Example:
>>
>> - The "make-distribution.sh" script has been updated to take
>> additional parameters to select the Hadoop version and enable YARN.
>>
>>
>>
>> When Writing Spark Applications
>> ========================
>>
>>
>> - Non-YARN users: If you wish to use HDFS, you will need to add the
>> appropriate version of the "hadoop-client" artifact from the
>> "org.apache.hadoop" group to your project.
>>
>>     SBT example:
>>         // "force()" is required because "1.1.0" is less than Spark's
>> default of "1.2.1"
>>         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
>>
>>     Maven example:
>>         <dependency>
>>           <groupId>org.apache.hadoop</groupId>
>>           <artifactId>hadoop-client</artifactId>
>>           <!-- the brackets are needed to tell Maven that this is a
>> hard dependency on version "1.1.0" exactly -->
>>           <version>[1.1.0]</version>
>>         </dependency>
>>
>>
>> - YARN users: You will now need to set SPARK_JAR to point to the
>> spark-yarn assembly instead of the spark-core assembly previously
>> used.
>>
>>   SBT Example:
>>        SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
>>         ./run spark.deploy.yarn.Client \
>>           --jar
>> $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
>> \
>>           --class spark.examples.SparkPi --args yarn-standalone \
>>           --num-workers 3 --worker-memory 2g --master-memory 2g
>> --worker-cores 1
>
>

Re: Important: Changes to Spark's build system on master branch

Posted by Henry Saputra <he...@gmail.com>.

Hi Jey, just want to clarify that the changes happen to master branch in
the github and not the Apache git repository?

Thanks,

Henry


On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:

> The master branch of Spark has been updated with PR #838, which
> changes aspects of Spark's interface to Hadoop. This involved also
> making changes to Spark's build system as documented below. The
> documentation will be updated with this information shortly.
>
> Please feel free to reply to this thread with any questions or if you
> encounter any problems.
>
> -Jey
>
>
>
> When Building Spark
> ===============
>
> - General: The default version of Hadoop has been updated to 1.2.1 from
> 1.0.4.
>
> - General: You will probably need to perform an "sbt clean" or "mvn
> clean" to remove old build files. SBT users may also need to perform a
> "clean" when changing Hadoop versions (or at least delete the
> lib_managed directory).
>
> - SBT users: The version of Hadoop used can be specified by setting
> the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
> YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
> Example:
>
>     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
>     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
>
>     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
>     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
> package assembly
>
> - Maven users: Set the Hadoop version built against by editing the
> "pom.xml" file in the root directory and changing the "hadoop.version"
> property (and, the "yarn.version" property if applicable). If you are
> building with YARN disabled, you no longer need to enable any Maven
> profiles (i.e. "-P" flags). To build with YARN enabled, use the
> "hadoop2-yarn" Maven profile. Example:
>
> - The "make-distribution.sh" script has been updated to take
> additional parameters to select the Hadoop version and enable YARN.
>
>
>
> When Writing Spark Applications
> ========================
>
>
> - Non-YARN users: If you wish to use HDFS, you will need to add the
> appropriate version of the "hadoop-client" artifact from the
> "org.apache.hadoop" group to your project.
>
>     SBT example:
>         // "force()" is required because "1.1.0" is less than Spark's
> default of "1.2.1"
>         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
>
>     Maven example:
>         <dependency>
>           <groupId>org.apache.hadoop</groupId>
>           <artifactId>hadoop-client</artifactId>
>           <!-- the brackets are needed to tell Maven that this is a
> hard dependency on version "1.1.0" exactly -->
>           <version>[1.1.0]</version>
>         </dependency>
>
>
> - YARN users: You will now need to set SPARK_JAR to point to the
> spark-yarn assembly instead of the spark-core assembly previously
> used.
>
>   SBT Example:
>        SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
>         ./run spark.deploy.yarn.Client \
>           --jar
> $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
> \
>           --class spark.examples.SparkPi --args yarn-standalone \
>           --num-workers 3 --worker-memory 2g --master-memory 2g
> --worker-cores 1
>

Re: Important: Changes to Spark's build system on master branch

Posted by Konstantin Boudnik <co...@apache.org>.

Looked into the code - totally makes sense now. Sorry for the fuss - was
emailing from my phone being away from a normal computer and access to the
code.

Very good change indeed, thanks Jey!
  Cos

On Wed, Aug 21, 2013 at 04:59PM, Matei Zaharia wrote:
> I understand this Cos, but Jey's patch actually removes the idea of
> "hadoop2". You only set SPARK_HADOOP_VERSION (which can be 1.0.x,
> 2.0.0-cdh4, 2.0.5-alpha, etc) and possibly SPARK_YARN_MODE if you want to
> run on YARN.
> 
> Matei
> 
> On Aug 21, 2013, at 4:50 PM, Konstantin Boudnik <co...@apache.org> wrote:
> 
> > I hear you guys - and I am well aware about the differences between the two.
> > However, actual Hadoop2 doesn't even have such thing as MR1 - this is why
> > profile naming is misleading. What you see under the current profile 'hadoop2'
> > is essentially a commercial hack, that doesn't exist anywhere beyond CDH
> > artifacts (and event there not for long).
> > 
> > Besides, YARN != MR2 :) YARN is a resource manager that, among other things,
> > provides for running MR applications on it.
> > 
> > We can argue about semantics till blue in the face, but the reality is simple:
> > current 'hadoop2' profile doesn't reflect Hadoop2 facts. That's my only point.
> > 
> > Cos
> > 
> > On Wed, Aug 21, 2013 at 01:20PM, Jey Kottalam wrote:
> >> As Mridul points out, the old "hadoop1" and "hadoop2" terminology
> >> referred to the versions of certain interfaces and classes within
> >> Hadoop. With these latest changes we have unified the handling of both
> >> hadoop1 and hadoop2 interfaces so that the build is agnostic to the
> >> exact Hadoop version available at runtime.
> >> 
> >> However, the distinction between YARN-enabled and non-YARN builds does
> >> still exist. I propose that we retroactively reinterpret
> >> "hadoop2-yarn" as shorthand for "Hadoop MapReduce v2 (aka YARN)".
> >> 
> >> -Jey
> >> 
> >> On Wed, Aug 21, 2013 at 1:04 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
> >>> hadoop2, in this context, is use of spark on a hadoop cluster without
> >>> yarn but with hadoop2 interfaces.
> >>> hadoop2-yarn uses yarn RM to launch a spark job (and obviously uses
> >>> hadoop2 interfaces).
> >>> 
> >>> Regards,
> >>> Mridul
> >>> 
> >>> On Wed, Aug 21, 2013 at 11:52 PM, Konstantin Boudnik <co...@apache.org> wrote:
> >>>> For what it worth guys - hadoop2 profile content is misleading: CDH isn't
> >>>> Hadoop2: it has 1354 patches on top of Hadoop2 alpha.
> >>>> 
> >>>> What is called hadoop2-yarn is actually hadoop2. Perhaps, while we are at it
> >>>> the profiles need to be renamed. I can supply the patch if the community is ok
> >>>> with it.
> >>>> 
> >>>> Cos
> >>>> 
> >>>> On Tue, Aug 20, 2013 at 11:36PM, Andy Konwinski wrote:
> >>>>> Hey Jey,
> >>>>> 
> >>>>> I'd just like to add that you can also run hadoop2 without modifying the
> >>>>> pom.xml file by passing the hadoop.version property at the command line
> >>>>> like this:
> >>>>> 
> >>>>> mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 clean verify
> >>>>> 
> >>>>> Also, when you mentioned building with Maven in your instructions I think
> >>>>> you forgot to finish writing out your example for activating the yarn
> >>>>> profile, which I think would be something like:
> >>>>> 
> >>>>> mvn -Phadoop2-yarn clean verify
> >>>>> 
> >>>>> ...right?
> >>>>> 
> >>>>> BTW, I've set up the AMPLab Jenkins Spark Maven Hadoop2 project to build
> >>>>> using the new options
> >>>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Hadoop2/
> >>>>> 
> >>>>> Andy
> >>>>> 
> >>>>> On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
> >>>>> 
> >>>>>> The master branch of Spark has been updated with PR #838, which
> >>>>>> changes aspects of Spark's interface to Hadoop. This involved also
> >>>>>> making changes to Spark's build system as documented below. The
> >>>>>> documentation will be updated with this information shortly.
> >>>>>> 
> >>>>>> Please feel free to reply to this thread with any questions or if you
> >>>>>> encounter any problems.
> >>>>>> 
> >>>>>> -Jey
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> When Building Spark
> >>>>>> ===============
> >>>>>> 
> >>>>>> - General: The default version of Hadoop has been updated to 1.2.1 from
> >>>>>> 1.0.4.
> >>>>>> 
> >>>>>> - General: You will probably need to perform an "sbt clean" or "mvn
> >>>>>> clean" to remove old build files. SBT users may also need to perform a
> >>>>>> "clean" when changing Hadoop versions (or at least delete the
> >>>>>> lib_managed directory).
> >>>>>> 
> >>>>>> - SBT users: The version of Hadoop used can be specified by setting
> >>>>>> the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
> >>>>>> YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
> >>>>>> Example:
> >>>>>> 
> >>>>>>    # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
> >>>>>>    SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
> >>>>>> 
> >>>>>>    # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
> >>>>>>    SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
> >>>>>> package assembly
> >>>>>> 
> >>>>>> - Maven users: Set the Hadoop version built against by editing the
> >>>>>> "pom.xml" file in the root directory and changing the "hadoop.version"
> >>>>>> property (and, the "yarn.version" property if applicable). If you are
> >>>>>> building with YARN disabled, you no longer need to enable any Maven
> >>>>>> profiles (i.e. "-P" flags). To build with YARN enabled, use the
> >>>>>> "hadoop2-yarn" Maven profile. Example:
> >>>>>> 
> >>>>>> - The "make-distribution.sh" script has been updated to take
> >>>>>> additional parameters to select the Hadoop version and enable YARN.
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> When Writing Spark Applications
> >>>>>> ========================
> >>>>>> 
> >>>>>> 
> >>>>>> - Non-YARN users: If you wish to use HDFS, you will need to add the
> >>>>>> appropriate version of the "hadoop-client" artifact from the
> >>>>>> "org.apache.hadoop" group to your project.
> >>>>>> 
> >>>>>>    SBT example:
> >>>>>>        // "force()" is required because "1.1.0" is less than Spark's
> >>>>>> default of "1.2.1"
> >>>>>>        "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
> >>>>>> 
> >>>>>>    Maven example:
> >>>>>>        <dependency>
> >>>>>>          <groupId>org.apache.hadoop</groupId>
> >>>>>>          <artifactId>hadoop-client</artifactId>
> >>>>>>          <!-- the brackets are needed to tell Maven that this is a
> >>>>>> hard dependency on version "1.1.0" exactly -->
> >>>>>>          <version>[1.1.0]</version>
> >>>>>>        </dependency>
> >>>>>> 
> >>>>>> 
> >>>>>> - YARN users: You will now need to set SPARK_JAR to point to the
> >>>>>> spark-yarn assembly instead of the spark-core assembly previously
> >>>>>> used.
> >>>>>> 
> >>>>>>  SBT Example:
> >>>>>>       SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
> >>>>>>        ./run spark.deploy.yarn.Client \
> >>>>>>          --jar
> >>>>>> $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
> >>>>>> \
> >>>>>>          --class spark.examples.SparkPi --args yarn-standalone \
> >>>>>>          --num-workers 3 --worker-memory 2g --master-memory 2g
> >>>>>> --worker-cores 1
> >>>>>> 
>

Re: Important: Changes to Spark's build system on master branch

Posted by Matei Zaharia <ma...@gmail.com>.

I understand this Cos, but Jey's patch actually removes the idea of "hadoop2". You only set SPARK_HADOOP_VERSION (which can be 1.0.x, 2.0.0-cdh4, 2.0.5-alpha, etc) and possibly SPARK_YARN_MODE if you want to run on YARN.

Matei

On Aug 21, 2013, at 4:50 PM, Konstantin Boudnik <co...@apache.org> wrote:

> I hear you guys - and I am well aware about the differences between the two.
> However, actual Hadoop2 doesn't even have such thing as MR1 - this is why
> profile naming is misleading. What you see under the current profile 'hadoop2'
> is essentially a commercial hack, that doesn't exist anywhere beyond CDH
> artifacts (and event there not for long).
> 
> Besides, YARN != MR2 :) YARN is a resource manager that, among other things,
> provides for running MR applications on it.
> 
> We can argue about semantics till blue in the face, but the reality is simple:
> current 'hadoop2' profile doesn't reflect Hadoop2 facts. That's my only point.
> 
> Cos
> 
> On Wed, Aug 21, 2013 at 01:20PM, Jey Kottalam wrote:
>> As Mridul points out, the old "hadoop1" and "hadoop2" terminology
>> referred to the versions of certain interfaces and classes within
>> Hadoop. With these latest changes we have unified the handling of both
>> hadoop1 and hadoop2 interfaces so that the build is agnostic to the
>> exact Hadoop version available at runtime.
>> 
>> However, the distinction between YARN-enabled and non-YARN builds does
>> still exist. I propose that we retroactively reinterpret
>> "hadoop2-yarn" as shorthand for "Hadoop MapReduce v2 (aka YARN)".
>> 
>> -Jey
>> 
>> On Wed, Aug 21, 2013 at 1:04 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>> hadoop2, in this context, is use of spark on a hadoop cluster without
>>> yarn but with hadoop2 interfaces.
>>> hadoop2-yarn uses yarn RM to launch a spark job (and obviously uses
>>> hadoop2 interfaces).
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> On Wed, Aug 21, 2013 at 11:52 PM, Konstantin Boudnik <co...@apache.org> wrote:
>>>> For what it worth guys - hadoop2 profile content is misleading: CDH isn't
>>>> Hadoop2: it has 1354 patches on top of Hadoop2 alpha.
>>>> 
>>>> What is called hadoop2-yarn is actually hadoop2. Perhaps, while we are at it
>>>> the profiles need to be renamed. I can supply the patch if the community is ok
>>>> with it.
>>>> 
>>>> Cos
>>>> 
>>>> On Tue, Aug 20, 2013 at 11:36PM, Andy Konwinski wrote:
>>>>> Hey Jey,
>>>>> 
>>>>> I'd just like to add that you can also run hadoop2 without modifying the
>>>>> pom.xml file by passing the hadoop.version property at the command line
>>>>> like this:
>>>>> 
>>>>> mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 clean verify
>>>>> 
>>>>> Also, when you mentioned building with Maven in your instructions I think
>>>>> you forgot to finish writing out your example for activating the yarn
>>>>> profile, which I think would be something like:
>>>>> 
>>>>> mvn -Phadoop2-yarn clean verify
>>>>> 
>>>>> ...right?
>>>>> 
>>>>> BTW, I've set up the AMPLab Jenkins Spark Maven Hadoop2 project to build
>>>>> using the new options
>>>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Hadoop2/
>>>>> 
>>>>> Andy
>>>>> 
>>>>> On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
>>>>> 
>>>>>> The master branch of Spark has been updated with PR #838, which
>>>>>> changes aspects of Spark's interface to Hadoop. This involved also
>>>>>> making changes to Spark's build system as documented below. The
>>>>>> documentation will be updated with this information shortly.
>>>>>> 
>>>>>> Please feel free to reply to this thread with any questions or if you
>>>>>> encounter any problems.
>>>>>> 
>>>>>> -Jey
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> When Building Spark
>>>>>> ===============
>>>>>> 
>>>>>> - General: The default version of Hadoop has been updated to 1.2.1 from
>>>>>> 1.0.4.
>>>>>> 
>>>>>> - General: You will probably need to perform an "sbt clean" or "mvn
>>>>>> clean" to remove old build files. SBT users may also need to perform a
>>>>>> "clean" when changing Hadoop versions (or at least delete the
>>>>>> lib_managed directory).
>>>>>> 
>>>>>> - SBT users: The version of Hadoop used can be specified by setting
>>>>>> the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
>>>>>> YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
>>>>>> Example:
>>>>>> 
>>>>>>    # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
>>>>>>    SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
>>>>>> 
>>>>>>    # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
>>>>>>    SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
>>>>>> package assembly
>>>>>> 
>>>>>> - Maven users: Set the Hadoop version built against by editing the
>>>>>> "pom.xml" file in the root directory and changing the "hadoop.version"
>>>>>> property (and, the "yarn.version" property if applicable). If you are
>>>>>> building with YARN disabled, you no longer need to enable any Maven
>>>>>> profiles (i.e. "-P" flags). To build with YARN enabled, use the
>>>>>> "hadoop2-yarn" Maven profile. Example:
>>>>>> 
>>>>>> - The "make-distribution.sh" script has been updated to take
>>>>>> additional parameters to select the Hadoop version and enable YARN.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> When Writing Spark Applications
>>>>>> ========================
>>>>>> 
>>>>>> 
>>>>>> - Non-YARN users: If you wish to use HDFS, you will need to add the
>>>>>> appropriate version of the "hadoop-client" artifact from the
>>>>>> "org.apache.hadoop" group to your project.
>>>>>> 
>>>>>>    SBT example:
>>>>>>        // "force()" is required because "1.1.0" is less than Spark's
>>>>>> default of "1.2.1"
>>>>>>        "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
>>>>>> 
>>>>>>    Maven example:
>>>>>>        <dependency>
>>>>>>          <groupId>org.apache.hadoop</groupId>
>>>>>>          <artifactId>hadoop-client</artifactId>
>>>>>>          <!-- the brackets are needed to tell Maven that this is a
>>>>>> hard dependency on version "1.1.0" exactly -->
>>>>>>          <version>[1.1.0]</version>
>>>>>>        </dependency>
>>>>>> 
>>>>>> 
>>>>>> - YARN users: You will now need to set SPARK_JAR to point to the
>>>>>> spark-yarn assembly instead of the spark-core assembly previously
>>>>>> used.
>>>>>> 
>>>>>>  SBT Example:
>>>>>>       SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
>>>>>>        ./run spark.deploy.yarn.Client \
>>>>>>          --jar
>>>>>> $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
>>>>>> \
>>>>>>          --class spark.examples.SparkPi --args yarn-standalone \
>>>>>>          --num-workers 3 --worker-memory 2g --master-memory 2g
>>>>>> --worker-cores 1
>>>>>>

Re: Important: Changes to Spark's build system on master branch

Posted by Konstantin Boudnik <co...@apache.org>.

I hear you guys - and I am well aware about the differences between the two.
However, actual Hadoop2 doesn't even have such thing as MR1 - this is why
profile naming is misleading. What you see under the current profile 'hadoop2'
is essentially a commercial hack, that doesn't exist anywhere beyond CDH
artifacts (and event there not for long).

Besides, YARN != MR2 :) YARN is a resource manager that, among other things,
provides for running MR applications on it.

We can argue about semantics till blue in the face, but the reality is simple:
current 'hadoop2' profile doesn't reflect Hadoop2 facts. That's my only point.

Cos

On Wed, Aug 21, 2013 at 01:20PM, Jey Kottalam wrote:
> As Mridul points out, the old "hadoop1" and "hadoop2" terminology
> referred to the versions of certain interfaces and classes within
> Hadoop. With these latest changes we have unified the handling of both
> hadoop1 and hadoop2 interfaces so that the build is agnostic to the
> exact Hadoop version available at runtime.
> 
> However, the distinction between YARN-enabled and non-YARN builds does
> still exist. I propose that we retroactively reinterpret
> "hadoop2-yarn" as shorthand for "Hadoop MapReduce v2 (aka YARN)".
> 
> -Jey
> 
> On Wed, Aug 21, 2013 at 1:04 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
> > hadoop2, in this context, is use of spark on a hadoop cluster without
> > yarn but with hadoop2 interfaces.
> > hadoop2-yarn uses yarn RM to launch a spark job (and obviously uses
> > hadoop2 interfaces).
> >
> > Regards,
> > Mridul
> >
> > On Wed, Aug 21, 2013 at 11:52 PM, Konstantin Boudnik <co...@apache.org> wrote:
> >> For what it worth guys - hadoop2 profile content is misleading: CDH isn't
> >> Hadoop2: it has 1354 patches on top of Hadoop2 alpha.
> >>
> >> What is called hadoop2-yarn is actually hadoop2. Perhaps, while we are at it
> >> the profiles need to be renamed. I can supply the patch if the community is ok
> >> with it.
> >>
> >> Cos
> >>
> >> On Tue, Aug 20, 2013 at 11:36PM, Andy Konwinski wrote:
> >>> Hey Jey,
> >>>
> >>> I'd just like to add that you can also run hadoop2 without modifying the
> >>> pom.xml file by passing the hadoop.version property at the command line
> >>> like this:
> >>>
> >>> mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 clean verify
> >>>
> >>> Also, when you mentioned building with Maven in your instructions I think
> >>> you forgot to finish writing out your example for activating the yarn
> >>> profile, which I think would be something like:
> >>>
> >>> mvn -Phadoop2-yarn clean verify
> >>>
> >>> ...right?
> >>>
> >>> BTW, I've set up the AMPLab Jenkins Spark Maven Hadoop2 project to build
> >>> using the new options
> >>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Hadoop2/
> >>>
> >>> Andy
> >>>
> >>> On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
> >>>
> >>> > The master branch of Spark has been updated with PR #838, which
> >>> > changes aspects of Spark's interface to Hadoop. This involved also
> >>> > making changes to Spark's build system as documented below. The
> >>> > documentation will be updated with this information shortly.
> >>> >
> >>> > Please feel free to reply to this thread with any questions or if you
> >>> > encounter any problems.
> >>> >
> >>> > -Jey
> >>> >
> >>> >
> >>> >
> >>> > When Building Spark
> >>> > ===============
> >>> >
> >>> > - General: The default version of Hadoop has been updated to 1.2.1 from
> >>> > 1.0.4.
> >>> >
> >>> > - General: You will probably need to perform an "sbt clean" or "mvn
> >>> > clean" to remove old build files. SBT users may also need to perform a
> >>> > "clean" when changing Hadoop versions (or at least delete the
> >>> > lib_managed directory).
> >>> >
> >>> > - SBT users: The version of Hadoop used can be specified by setting
> >>> > the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
> >>> > YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
> >>> > Example:
> >>> >
> >>> >     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
> >>> >     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
> >>> >
> >>> >     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
> >>> >     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
> >>> > package assembly
> >>> >
> >>> > - Maven users: Set the Hadoop version built against by editing the
> >>> > "pom.xml" file in the root directory and changing the "hadoop.version"
> >>> > property (and, the "yarn.version" property if applicable). If you are
> >>> > building with YARN disabled, you no longer need to enable any Maven
> >>> > profiles (i.e. "-P" flags). To build with YARN enabled, use the
> >>> > "hadoop2-yarn" Maven profile. Example:
> >>> >
> >>> > - The "make-distribution.sh" script has been updated to take
> >>> > additional parameters to select the Hadoop version and enable YARN.
> >>> >
> >>> >
> >>> >
> >>> > When Writing Spark Applications
> >>> > ========================
> >>> >
> >>> >
> >>> > - Non-YARN users: If you wish to use HDFS, you will need to add the
> >>> > appropriate version of the "hadoop-client" artifact from the
> >>> > "org.apache.hadoop" group to your project.
> >>> >
> >>> >     SBT example:
> >>> >         // "force()" is required because "1.1.0" is less than Spark's
> >>> > default of "1.2.1"
> >>> >         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
> >>> >
> >>> >     Maven example:
> >>> >         <dependency>
> >>> >           <groupId>org.apache.hadoop</groupId>
> >>> >           <artifactId>hadoop-client</artifactId>
> >>> >           <!-- the brackets are needed to tell Maven that this is a
> >>> > hard dependency on version "1.1.0" exactly -->
> >>> >           <version>[1.1.0]</version>
> >>> >         </dependency>
> >>> >
> >>> >
> >>> > - YARN users: You will now need to set SPARK_JAR to point to the
> >>> > spark-yarn assembly instead of the spark-core assembly previously
> >>> > used.
> >>> >
> >>> >   SBT Example:
> >>> >        SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
> >>> >         ./run spark.deploy.yarn.Client \
> >>> >           --jar
> >>> > $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
> >>> > \
> >>> >           --class spark.examples.SparkPi --args yarn-standalone \
> >>> >           --num-workers 3 --worker-memory 2g --master-memory 2g
> >>> > --worker-cores 1
> >>> >

Re: Important: Changes to Spark's build system on master branch

Posted by Jey Kottalam <je...@cs.berkeley.edu>.

As Mridul points out, the old "hadoop1" and "hadoop2" terminology
referred to the versions of certain interfaces and classes within
Hadoop. With these latest changes we have unified the handling of both
hadoop1 and hadoop2 interfaces so that the build is agnostic to the
exact Hadoop version available at runtime.

However, the distinction between YARN-enabled and non-YARN builds does
still exist. I propose that we retroactively reinterpret
"hadoop2-yarn" as shorthand for "Hadoop MapReduce v2 (aka YARN)".

-Jey

On Wed, Aug 21, 2013 at 1:04 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
> hadoop2, in this context, is use of spark on a hadoop cluster without
> yarn but with hadoop2 interfaces.
> hadoop2-yarn uses yarn RM to launch a spark job (and obviously uses
> hadoop2 interfaces).
>
> Regards,
> Mridul
>
> On Wed, Aug 21, 2013 at 11:52 PM, Konstantin Boudnik <co...@apache.org> wrote:
>> For what it worth guys - hadoop2 profile content is misleading: CDH isn't
>> Hadoop2: it has 1354 patches on top of Hadoop2 alpha.
>>
>> What is called hadoop2-yarn is actually hadoop2. Perhaps, while we are at it
>> the profiles need to be renamed. I can supply the patch if the community is ok
>> with it.
>>
>> Cos
>>
>> On Tue, Aug 20, 2013 at 11:36PM, Andy Konwinski wrote:
>>> Hey Jey,
>>>
>>> I'd just like to add that you can also run hadoop2 without modifying the
>>> pom.xml file by passing the hadoop.version property at the command line
>>> like this:
>>>
>>> mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 clean verify
>>>
>>> Also, when you mentioned building with Maven in your instructions I think
>>> you forgot to finish writing out your example for activating the yarn
>>> profile, which I think would be something like:
>>>
>>> mvn -Phadoop2-yarn clean verify
>>>
>>> ...right?
>>>
>>> BTW, I've set up the AMPLab Jenkins Spark Maven Hadoop2 project to build
>>> using the new options
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Hadoop2/
>>>
>>> Andy
>>>
>>> On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
>>>
>>> > The master branch of Spark has been updated with PR #838, which
>>> > changes aspects of Spark's interface to Hadoop. This involved also
>>> > making changes to Spark's build system as documented below. The
>>> > documentation will be updated with this information shortly.
>>> >
>>> > Please feel free to reply to this thread with any questions or if you
>>> > encounter any problems.
>>> >
>>> > -Jey
>>> >
>>> >
>>> >
>>> > When Building Spark
>>> > ===============
>>> >
>>> > - General: The default version of Hadoop has been updated to 1.2.1 from
>>> > 1.0.4.
>>> >
>>> > - General: You will probably need to perform an "sbt clean" or "mvn
>>> > clean" to remove old build files. SBT users may also need to perform a
>>> > "clean" when changing Hadoop versions (or at least delete the
>>> > lib_managed directory).
>>> >
>>> > - SBT users: The version of Hadoop used can be specified by setting
>>> > the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
>>> > YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
>>> > Example:
>>> >
>>> >     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
>>> >     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
>>> >
>>> >     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
>>> >     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
>>> > package assembly
>>> >
>>> > - Maven users: Set the Hadoop version built against by editing the
>>> > "pom.xml" file in the root directory and changing the "hadoop.version"
>>> > property (and, the "yarn.version" property if applicable). If you are
>>> > building with YARN disabled, you no longer need to enable any Maven
>>> > profiles (i.e. "-P" flags). To build with YARN enabled, use the
>>> > "hadoop2-yarn" Maven profile. Example:
>>> >
>>> > - The "make-distribution.sh" script has been updated to take
>>> > additional parameters to select the Hadoop version and enable YARN.
>>> >
>>> >
>>> >
>>> > When Writing Spark Applications
>>> > ========================
>>> >
>>> >
>>> > - Non-YARN users: If you wish to use HDFS, you will need to add the
>>> > appropriate version of the "hadoop-client" artifact from the
>>> > "org.apache.hadoop" group to your project.
>>> >
>>> >     SBT example:
>>> >         // "force()" is required because "1.1.0" is less than Spark's
>>> > default of "1.2.1"
>>> >         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
>>> >
>>> >     Maven example:
>>> >         <dependency>
>>> >           <groupId>org.apache.hadoop</groupId>
>>> >           <artifactId>hadoop-client</artifactId>
>>> >           <!-- the brackets are needed to tell Maven that this is a
>>> > hard dependency on version "1.1.0" exactly -->
>>> >           <version>[1.1.0]</version>
>>> >         </dependency>
>>> >
>>> >
>>> > - YARN users: You will now need to set SPARK_JAR to point to the
>>> > spark-yarn assembly instead of the spark-core assembly previously
>>> > used.
>>> >
>>> >   SBT Example:
>>> >        SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
>>> >         ./run spark.deploy.yarn.Client \
>>> >           --jar
>>> > $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
>>> > \
>>> >           --class spark.examples.SparkPi --args yarn-standalone \
>>> >           --num-workers 3 --worker-memory 2g --master-memory 2g
>>> > --worker-cores 1
>>> >

Re: Important: Changes to Spark's build system on master branch

Posted by Mridul Muralidharan <mr...@gmail.com>.

hadoop2, in this context, is use of spark on a hadoop cluster without
yarn but with hadoop2 interfaces.
hadoop2-yarn uses yarn RM to launch a spark job (and obviously uses
hadoop2 interfaces).

Regards,
Mridul

On Wed, Aug 21, 2013 at 11:52 PM, Konstantin Boudnik <co...@apache.org> wrote:
> For what it worth guys - hadoop2 profile content is misleading: CDH isn't
> Hadoop2: it has 1354 patches on top of Hadoop2 alpha.
>
> What is called hadoop2-yarn is actually hadoop2. Perhaps, while we are at it
> the profiles need to be renamed. I can supply the patch if the community is ok
> with it.
>
> Cos
>
> On Tue, Aug 20, 2013 at 11:36PM, Andy Konwinski wrote:
>> Hey Jey,
>>
>> I'd just like to add that you can also run hadoop2 without modifying the
>> pom.xml file by passing the hadoop.version property at the command line
>> like this:
>>
>> mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 clean verify
>>
>> Also, when you mentioned building with Maven in your instructions I think
>> you forgot to finish writing out your example for activating the yarn
>> profile, which I think would be something like:
>>
>> mvn -Phadoop2-yarn clean verify
>>
>> ...right?
>>
>> BTW, I've set up the AMPLab Jenkins Spark Maven Hadoop2 project to build
>> using the new options
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Hadoop2/
>>
>> Andy
>>
>> On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
>>
>> > The master branch of Spark has been updated with PR #838, which
>> > changes aspects of Spark's interface to Hadoop. This involved also
>> > making changes to Spark's build system as documented below. The
>> > documentation will be updated with this information shortly.
>> >
>> > Please feel free to reply to this thread with any questions or if you
>> > encounter any problems.
>> >
>> > -Jey
>> >
>> >
>> >
>> > When Building Spark
>> > ===============
>> >
>> > - General: The default version of Hadoop has been updated to 1.2.1 from
>> > 1.0.4.
>> >
>> > - General: You will probably need to perform an "sbt clean" or "mvn
>> > clean" to remove old build files. SBT users may also need to perform a
>> > "clean" when changing Hadoop versions (or at least delete the
>> > lib_managed directory).
>> >
>> > - SBT users: The version of Hadoop used can be specified by setting
>> > the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
>> > YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
>> > Example:
>> >
>> >     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
>> >     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
>> >
>> >     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
>> >     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
>> > package assembly
>> >
>> > - Maven users: Set the Hadoop version built against by editing the
>> > "pom.xml" file in the root directory and changing the "hadoop.version"
>> > property (and, the "yarn.version" property if applicable). If you are
>> > building with YARN disabled, you no longer need to enable any Maven
>> > profiles (i.e. "-P" flags). To build with YARN enabled, use the
>> > "hadoop2-yarn" Maven profile. Example:
>> >
>> > - The "make-distribution.sh" script has been updated to take
>> > additional parameters to select the Hadoop version and enable YARN.
>> >
>> >
>> >
>> > When Writing Spark Applications
>> > ========================
>> >
>> >
>> > - Non-YARN users: If you wish to use HDFS, you will need to add the
>> > appropriate version of the "hadoop-client" artifact from the
>> > "org.apache.hadoop" group to your project.
>> >
>> >     SBT example:
>> >         // "force()" is required because "1.1.0" is less than Spark's
>> > default of "1.2.1"
>> >         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
>> >
>> >     Maven example:
>> >         <dependency>
>> >           <groupId>org.apache.hadoop</groupId>
>> >           <artifactId>hadoop-client</artifactId>
>> >           <!-- the brackets are needed to tell Maven that this is a
>> > hard dependency on version "1.1.0" exactly -->
>> >           <version>[1.1.0]</version>
>> >         </dependency>
>> >
>> >
>> > - YARN users: You will now need to set SPARK_JAR to point to the
>> > spark-yarn assembly instead of the spark-core assembly previously
>> > used.
>> >
>> >   SBT Example:
>> >        SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
>> >         ./run spark.deploy.yarn.Client \
>> >           --jar
>> > $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
>> > \
>> >           --class spark.examples.SparkPi --args yarn-standalone \
>> >           --num-workers 3 --worker-memory 2g --master-memory 2g
>> > --worker-cores 1
>> >

Re: Important: Changes to Spark's build system on master branch

Posted by Konstantin Boudnik <co...@apache.org>.

For what it worth guys - hadoop2 profile content is misleading: CDH isn't
Hadoop2: it has 1354 patches on top of Hadoop2 alpha.

What is called hadoop2-yarn is actually hadoop2. Perhaps, while we are at it
the profiles need to be renamed. I can supply the patch if the community is ok
with it.

Cos
 
On Tue, Aug 20, 2013 at 11:36PM, Andy Konwinski wrote:
> Hey Jey,
> 
> I'd just like to add that you can also run hadoop2 without modifying the
> pom.xml file by passing the hadoop.version property at the command line
> like this:
> 
> mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 clean verify
> 
> Also, when you mentioned building with Maven in your instructions I think
> you forgot to finish writing out your example for activating the yarn
> profile, which I think would be something like:
> 
> mvn -Phadoop2-yarn clean verify
> 
> ...right?
> 
> BTW, I've set up the AMPLab Jenkins Spark Maven Hadoop2 project to build
> using the new options
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Hadoop2/
> 
> Andy
> 
> On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
> 
> > The master branch of Spark has been updated with PR #838, which
> > changes aspects of Spark's interface to Hadoop. This involved also
> > making changes to Spark's build system as documented below. The
> > documentation will be updated with this information shortly.
> >
> > Please feel free to reply to this thread with any questions or if you
> > encounter any problems.
> >
> > -Jey
> >
> >
> >
> > When Building Spark
> > ===============
> >
> > - General: The default version of Hadoop has been updated to 1.2.1 from
> > 1.0.4.
> >
> > - General: You will probably need to perform an "sbt clean" or "mvn
> > clean" to remove old build files. SBT users may also need to perform a
> > "clean" when changing Hadoop versions (or at least delete the
> > lib_managed directory).
> >
> > - SBT users: The version of Hadoop used can be specified by setting
> > the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
> > YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
> > Example:
> >
> >     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
> >     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
> >
> >     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
> >     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
> > package assembly
> >
> > - Maven users: Set the Hadoop version built against by editing the
> > "pom.xml" file in the root directory and changing the "hadoop.version"
> > property (and, the "yarn.version" property if applicable). If you are
> > building with YARN disabled, you no longer need to enable any Maven
> > profiles (i.e. "-P" flags). To build with YARN enabled, use the
> > "hadoop2-yarn" Maven profile. Example:
> >
> > - The "make-distribution.sh" script has been updated to take
> > additional parameters to select the Hadoop version and enable YARN.
> >
> >
> >
> > When Writing Spark Applications
> > ========================
> >
> >
> > - Non-YARN users: If you wish to use HDFS, you will need to add the
> > appropriate version of the "hadoop-client" artifact from the
> > "org.apache.hadoop" group to your project.
> >
> >     SBT example:
> >         // "force()" is required because "1.1.0" is less than Spark's
> > default of "1.2.1"
> >         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
> >
> >     Maven example:
> >         <dependency>
> >           <groupId>org.apache.hadoop</groupId>
> >           <artifactId>hadoop-client</artifactId>
> >           <!-- the brackets are needed to tell Maven that this is a
> > hard dependency on version "1.1.0" exactly -->
> >           <version>[1.1.0]</version>
> >         </dependency>
> >
> >
> > - YARN users: You will now need to set SPARK_JAR to point to the
> > spark-yarn assembly instead of the spark-core assembly previously
> > used.
> >
> >   SBT Example:
> >        SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
> >         ./run spark.deploy.yarn.Client \
> >           --jar
> > $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
> > \
> >           --class spark.examples.SparkPi --args yarn-standalone \
> >           --num-workers 3 --worker-memory 2g --master-memory 2g
> > --worker-cores 1
> >

Re: Important: Changes to Spark's build system on master branch

Posted by Andy Konwinski <an...@gmail.com>.

Hey Jey,

I'd just like to add that you can also run hadoop2 without modifying the
pom.xml file by passing the hadoop.version property at the command line
like this:

mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 clean verify

Also, when you mentioned building with Maven in your instructions I think
you forgot to finish writing out your example for activating the yarn
profile, which I think would be something like:

mvn -Phadoop2-yarn clean verify

...right?

BTW, I've set up the AMPLab Jenkins Spark Maven Hadoop2 project to build
using the new options
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Hadoop2/

Andy

On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:

> The master branch of Spark has been updated with PR #838, which
> changes aspects of Spark's interface to Hadoop. This involved also
> making changes to Spark's build system as documented below. The
> documentation will be updated with this information shortly.
>
> Please feel free to reply to this thread with any questions or if you
> encounter any problems.
>
> -Jey
>
>
>
> When Building Spark
> ===============
>
> - General: The default version of Hadoop has been updated to 1.2.1 from
> 1.0.4.
>
> - General: You will probably need to perform an "sbt clean" or "mvn
> clean" to remove old build files. SBT users may also need to perform a
> "clean" when changing Hadoop versions (or at least delete the
> lib_managed directory).
>
> - SBT users: The version of Hadoop used can be specified by setting
> the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
> YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
> Example:
>
>     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
>     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
>
>     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
>     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
> package assembly
>
> - Maven users: Set the Hadoop version built against by editing the
> "pom.xml" file in the root directory and changing the "hadoop.version"
> property (and, the "yarn.version" property if applicable). If you are
> building with YARN disabled, you no longer need to enable any Maven
> profiles (i.e. "-P" flags). To build with YARN enabled, use the
> "hadoop2-yarn" Maven profile. Example:
>
> - The "make-distribution.sh" script has been updated to take
> additional parameters to select the Hadoop version and enable YARN.
>
>
>
> When Writing Spark Applications
> ========================
>
>
> - Non-YARN users: If you wish to use HDFS, you will need to add the
> appropriate version of the "hadoop-client" artifact from the
> "org.apache.hadoop" group to your project.
>
>     SBT example:
>         // "force()" is required because "1.1.0" is less than Spark's
> default of "1.2.1"
>         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
>
>     Maven example:
>         <dependency>
>           <groupId>org.apache.hadoop</groupId>
>           <artifactId>hadoop-client</artifactId>
>           <!-- the brackets are needed to tell Maven that this is a
> hard dependency on version "1.1.0" exactly -->
>           <version>[1.1.0]</version>
>         </dependency>
>
>
> - YARN users: You will now need to set SPARK_JAR to point to the
> spark-yarn assembly instead of the spark-core assembly previously
> used.
>
>   SBT Example:
>        SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar \
>         ./run spark.deploy.yarn.Client \
>           --jar
> $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
> \
>           --class spark.examples.SparkPi --args yarn-standalone \
>           --num-workers 3 --worker-memory 2g --master-memory 2g
> --worker-cores 1
>