You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Nicholas Chammas <ni...@gmail.com> on 2015/11/24 05:18:05 UTC

Fastest way to build Spark from scratch

Say I want to build a complete Spark distribution against Hadoop 2.6+ as
fast as possible from scratch.

This is what I’m doing at the moment:

./make-distribution.sh -T 1C -Phadoop-2.6

-T 1C instructs Maven to spin up 1 thread per available core. This takes
around 20 minutes on an m3.large instance.

I see that spark-ec2, on the other hand, builds Spark as follows
<https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
when you deploy Spark at a specific git commit:

sbt/sbt clean assembly
sbt/sbt publish-local

This seems slower than using make-distribution.sh, actually.

Is there a faster way to do this?

Nick

Re: Fastest way to build Spark from scratch

Posted by Josh Rosen <jo...@databricks.com>.

Yeah, this is the same idea behind having Travis cache the ivy2 folder to
speed up builds. In Amplab Jenkins each individual build workspace has its
own individual Ivy cache which is preserved across build runs but which is
only used by one active run at a time in order to avoid SBT ivy lock
contention (this shouldn't be an issue in most environments though).
On Tue, Dec 8, 2015 at 10:32 AM Nicholas Chammas <ni...@gmail.com>
wrote:

> Interesting. As long as Spark's dependencies don't change that often, the
> same caches could save "from scratch" build time over many months of Spark
> development. Is that right?
>
>
> On Tue, Dec 8, 2015 at 12:33 PM Josh Rosen <jo...@databricks.com>
> wrote:
>
>> @Nick, on a fresh EC2 instance a significant chunk of the initial build
>> time might be due to artifact resolution + downloading. Putting
>> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
>> decent chunk of time off that first build.
>>
>> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> Thanks for the tips, Jakob and Steve.
>>>
>>> It looks like my original approach is the best for me since I'm
>>> installing Spark on newly launched EC2 instances and can't take advantage
>>> of incremental compilation.
>>>
>>> Nick
>>>
>>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <st...@hortonworks.com>
>>> wrote:
>>>
>>>> On 7 Dec 2015, at 19:07, Jakob Odersky <jo...@gmail.com> wrote:
>>>>
>>>> make-distribution and the second code snippet both create a
>>>> distribution from a clean state. They therefore require that every source
>>>> file be compiled and that takes time (you can maybe tweak some settings or
>>>> use a newer compiler to gain some speed).
>>>>
>>>> I'm inferring from your question that for your use-case deployment
>>>> speed is a critical issue, furthermore you'd like to build Spark for lots
>>>> of (every?) commit in a systematic way. In that case I would suggest you
>>>> try using the second code snippet without the `clean` task and only resort
>>>> to it if the build fails.
>>>>
>>>> On my local machine, an assembly without a clean drops from 6 minutes
>>>> to 2.
>>>>
>>>> regards,
>>>> --Jakob
>>>>
>>>>
>>>> 1. you can use zinc -where possible- to speed up scala compilations
>>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>>> whatever git repo & branch you are working off, and have it do the builds
>>>> and tests for you. Not so great for interactive dev,
>>>>
>>>> finally, on the mac, the "say" command is pretty handy at letting you
>>>> know when some work in a terminal is ready, so you can do the
>>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>>
>>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say
>>>> moo
>>>>
>>>> After that you can work on the modules you care about (via the -pl)
>>>> option). That doesn't work if you are running on an EC2 instance though
>>>>
>>>>
>>>>
>>>>
>>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>>> nicholas.chammas@gmail.com> wrote:
>>>>
>>>>> Say I want to build a complete Spark distribution against Hadoop 2.6+
>>>>> as fast as possible from scratch.
>>>>>
>>>>> This is what I’m doing at the moment:
>>>>>
>>>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>>>
>>>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>>>> takes around 20 minutes on an m3.large instance.
>>>>>
>>>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>>>>> when you deploy Spark at a specific git commit:
>>>>>
>>>>> sbt/sbt clean assembly
>>>>> sbt/sbt publish-local
>>>>>
>>>>> This seems slower than using make-distribution.sh, actually.
>>>>>
>>>>> Is there a faster way to do this?
>>>>>
>>>>> Nick
>>>>> 
>>>>>
>>>>
>>>>
>>>>
>>

Re: Fastest way to build Spark from scratch

Posted by Nicholas Chammas <ni...@gmail.com>.

Interesting. As long as Spark's dependencies don't change that often, the
same caches could save "from scratch" build time over many months of Spark
development. Is that right?

On Tue, Dec 8, 2015 at 12:33 PM Josh Rosen <jo...@databricks.com> wrote:

> @Nick, on a fresh EC2 instance a significant chunk of the initial build
> time might be due to artifact resolution + downloading. Putting
> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
> decent chunk of time off that first build.
>
> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> Thanks for the tips, Jakob and Steve.
>>
>> It looks like my original approach is the best for me since I'm
>> installing Spark on newly launched EC2 instances and can't take advantage
>> of incremental compilation.
>>
>> Nick
>>
>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <st...@hortonworks.com>
>> wrote:
>>
>>> On 7 Dec 2015, at 19:07, Jakob Odersky <jo...@gmail.com> wrote:
>>>
>>> make-distribution and the second code snippet both create a distribution
>>> from a clean state. They therefore require that every source file be
>>> compiled and that takes time (you can maybe tweak some settings or use a
>>> newer compiler to gain some speed).
>>>
>>> I'm inferring from your question that for your use-case deployment speed
>>> is a critical issue, furthermore you'd like to build Spark for lots of
>>> (every?) commit in a systematic way. In that case I would suggest you try
>>> using the second code snippet without the `clean` task and only resort to
>>> it if the build fails.
>>>
>>> On my local machine, an assembly without a clean drops from 6 minutes to
>>> 2.
>>>
>>> regards,
>>> --Jakob
>>>
>>>
>>> 1. you can use zinc -where possible- to speed up scala compilations
>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>> whatever git repo & branch you are working off, and have it do the builds
>>> and tests for you. Not so great for interactive dev,
>>>
>>> finally, on the mac, the "say" command is pretty handy at letting you
>>> know when some work in a terminal is ready, so you can do the
>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>
>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>>
>>> After that you can work on the modules you care about (via the -pl)
>>> option). That doesn't work if you are running on an EC2 instance though
>>>
>>>
>>>
>>>
>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>>> Say I want to build a complete Spark distribution against Hadoop 2.6+
>>>> as fast as possible from scratch.
>>>>
>>>> This is what I’m doing at the moment:
>>>>
>>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>>
>>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>>> takes around 20 minutes on an m3.large instance.
>>>>
>>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>>>> when you deploy Spark at a specific git commit:
>>>>
>>>> sbt/sbt clean assembly
>>>> sbt/sbt publish-local
>>>>
>>>> This seems slower than using make-distribution.sh, actually.
>>>>
>>>> Is there a faster way to do this?
>>>>
>>>> Nick
>>>> 
>>>>
>>>
>>>
>>>
>

Re: Fastest way to build Spark from scratch

Posted by Stephen Boesch <ja...@gmail.com>.

I will echo Steve L's comment about having zinc running (with --nailed).
That provides at least a 2X speedup - sometimes without it spark simply
does not build for me.

2015-12-08 9:33 GMT-08:00 Josh Rosen <jo...@databricks.com>:

> @Nick, on a fresh EC2 instance a significant chunk of the initial build
> time might be due to artifact resolution + downloading. Putting
> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
> decent chunk of time off that first build.
>
> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> Thanks for the tips, Jakob and Steve.
>>
>> It looks like my original approach is the best for me since I'm
>> installing Spark on newly launched EC2 instances and can't take advantage
>> of incremental compilation.
>>
>> Nick
>>
>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <st...@hortonworks.com>
>> wrote:
>>
>>> On 7 Dec 2015, at 19:07, Jakob Odersky <jo...@gmail.com> wrote:
>>>
>>> make-distribution and the second code snippet both create a distribution
>>> from a clean state. They therefore require that every source file be
>>> compiled and that takes time (you can maybe tweak some settings or use a
>>> newer compiler to gain some speed).
>>>
>>> I'm inferring from your question that for your use-case deployment speed
>>> is a critical issue, furthermore you'd like to build Spark for lots of
>>> (every?) commit in a systematic way. In that case I would suggest you try
>>> using the second code snippet without the `clean` task and only resort to
>>> it if the build fails.
>>>
>>> On my local machine, an assembly without a clean drops from 6 minutes to
>>> 2.
>>>
>>> regards,
>>> --Jakob
>>>
>>>
>>> 1. you can use zinc -where possible- to speed up scala compilations
>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>> whatever git repo & branch you are working off, and have it do the builds
>>> and tests for you. Not so great for interactive dev,
>>>
>>> finally, on the mac, the "say" command is pretty handy at letting you
>>> know when some work in a terminal is ready, so you can do the
>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>
>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>>
>>> After that you can work on the modules you care about (via the -pl)
>>> option). That doesn't work if you are running on an EC2 instance though
>>>
>>>
>>>
>>>
>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>>> Say I want to build a complete Spark distribution against Hadoop 2.6+
>>>> as fast as possible from scratch.
>>>>
>>>> This is what I’m doing at the moment:
>>>>
>>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>>
>>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>>> takes around 20 minutes on an m3.large instance.
>>>>
>>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>>>> when you deploy Spark at a specific git commit:
>>>>
>>>> sbt/sbt clean assembly
>>>> sbt/sbt publish-local
>>>>
>>>> This seems slower than using make-distribution.sh, actually.
>>>>
>>>> Is there a faster way to do this?
>>>>
>>>> Nick
>>>> 
>>>>
>>>
>>>
>>>
>

Re: Fastest way to build Spark from scratch

Posted by Josh Rosen <jo...@databricks.com>.

@Nick, on a fresh EC2 instance a significant chunk of the initial build
time might be due to artifact resolution + downloading. Putting
pre-populated Ivy and Maven caches onto your EC2 machine could shave a
decent chunk of time off that first build.

On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <nicholas.chammas@gmail.com
> wrote:

> Thanks for the tips, Jakob and Steve.
>
> It looks like my original approach is the best for me since I'm installing
> Spark on newly launched EC2 instances and can't take advantage of
> incremental compilation.
>
> Nick
>
> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <st...@hortonworks.com>
> wrote:
>
>> On 7 Dec 2015, at 19:07, Jakob Odersky <jo...@gmail.com> wrote:
>>
>> make-distribution and the second code snippet both create a distribution
>> from a clean state. They therefore require that every source file be
>> compiled and that takes time (you can maybe tweak some settings or use a
>> newer compiler to gain some speed).
>>
>> I'm inferring from your question that for your use-case deployment speed
>> is a critical issue, furthermore you'd like to build Spark for lots of
>> (every?) commit in a systematic way. In that case I would suggest you try
>> using the second code snippet without the `clean` task and only resort to
>> it if the build fails.
>>
>> On my local machine, an assembly without a clean drops from 6 minutes to
>> 2.
>>
>> regards,
>> --Jakob
>>
>>
>> 1. you can use zinc -where possible- to speed up scala compilations
>> 2. you might also consider setting up a local jenkins VM, hooked to
>> whatever git repo & branch you are working off, and have it do the builds
>> and tests for you. Not so great for interactive dev,
>>
>> finally, on the mac, the "say" command is pretty handy at letting you
>> know when some work in a terminal is ready, so you can do the
>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>
>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>
>> After that you can work on the modules you care about (via the -pl)
>> option). That doesn't work if you are running on an EC2 instance though
>>
>>
>>
>>
>> On 23 November 2015 at 20:18, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> Say I want to build a complete Spark distribution against Hadoop 2.6+ as
>>> fast as possible from scratch.
>>>
>>> This is what I’m doing at the moment:
>>>
>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>
>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>> takes around 20 minutes on an m3.large instance.
>>>
>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>>> when you deploy Spark at a specific git commit:
>>>
>>> sbt/sbt clean assembly
>>> sbt/sbt publish-local
>>>
>>> This seems slower than using make-distribution.sh, actually.
>>>
>>> Is there a faster way to do this?
>>>
>>> Nick
>>> 
>>>
>>
>>
>>

Re: Fastest way to build Spark from scratch

Posted by Nicholas Chammas <ni...@gmail.com>.

Thanks for the tips, Jakob and Steve.

It looks like my original approach is the best for me since I'm installing
Spark on newly launched EC2 instances and can't take advantage of
incremental compilation.

Nick

On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <st...@hortonworks.com>
wrote:

> On 7 Dec 2015, at 19:07, Jakob Odersky <jo...@gmail.com> wrote:
>
> make-distribution and the second code snippet both create a distribution
> from a clean state. They therefore require that every source file be
> compiled and that takes time (you can maybe tweak some settings or use a
> newer compiler to gain some speed).
>
> I'm inferring from your question that for your use-case deployment speed
> is a critical issue, furthermore you'd like to build Spark for lots of
> (every?) commit in a systematic way. In that case I would suggest you try
> using the second code snippet without the `clean` task and only resort to
> it if the build fails.
>
> On my local machine, an assembly without a clean drops from 6 minutes to 2.
>
> regards,
> --Jakob
>
>
> 1. you can use zinc -where possible- to speed up scala compilations
> 2. you might also consider setting up a local jenkins VM, hooked to
> whatever git repo & branch you are working off, and have it do the builds
> and tests for you. Not so great for interactive dev,
>
> finally, on the mac, the "say" command is pretty handy at letting you know
> when some work in a terminal is ready, so you can do the first-thing-in-the
> morning build-of-the-SNAPSHOTS
>
> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>
> After that you can work on the modules you care about (via the -pl)
> option). That doesn't work if you are running on an EC2 instance though
>
>
>
>
> On 23 November 2015 at 20:18, Nicholas Chammas <nicholas.chammas@gmail.com
> > wrote:
>
>> Say I want to build a complete Spark distribution against Hadoop 2.6+ as
>> fast as possible from scratch.
>>
>> This is what I’m doing at the moment:
>>
>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>
>> -T 1C instructs Maven to spin up 1 thread per available core. This takes
>> around 20 minutes on an m3.large instance.
>>
>> I see that spark-ec2, on the other hand, builds Spark as follows
>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>> when you deploy Spark at a specific git commit:
>>
>> sbt/sbt clean assembly
>> sbt/sbt publish-local
>>
>> This seems slower than using make-distribution.sh, actually.
>>
>> Is there a faster way to do this?
>>
>> Nick
>> 
>>
>
>
>

Re: Fastest way to build Spark from scratch

Posted by Steve Loughran <st...@hortonworks.com>.

On 7 Dec 2015, at 19:07, Jakob Odersky <jo...@gmail.com>> wrote:

make-distribution and the second code snippet both create a distribution from a clean state. They therefore require that every source file be compiled and that takes time (you can maybe tweak some settings or use a newer compiler to gain some speed).

I'm inferring from your question that for your use-case deployment speed is a critical issue, furthermore you'd like to build Spark for lots of (every?) commit in a systematic way. In that case I would suggest you try using the second code snippet without the `clean` task and only resort to it if the build fails.

On my local machine, an assembly without a clean drops from 6 minutes to 2.

regards,
--Jakob

1. you can use zinc -where possible- to speed up scala compilations
2. you might also consider setting up a local jenkins VM, hooked to whatever git repo & branch you are working off, and have it do the builds and tests for you. Not so great for interactive dev,

finally, on the mac, the "say" command is pretty handy at letting you know when some work in a terminal is ready, so you can do the first-thing-in-the morning build-of-the-SNAPSHOTS

mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo

After that you can work on the modules you care about (via the -pl) option). That doesn't work if you are running on an EC2 instance though

On 23 November 2015 at 20:18, Nicholas Chammas <ni...@gmail.com>> wrote:

Say I want to build a complete Spark distribution against Hadoop 2.6+ as fast as possible from scratch.

This is what I’m doing at the moment:

./make-distribution.sh -T 1C -Phadoop-2.6

-T 1C instructs Maven to spin up 1 thread per available core. This takes around 20 minutes on an m3.large instance.

I see that spark-ec2, on the other hand, builds Spark as follows<https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22> when you deploy Spark at a specific git commit:

sbt/sbt clean assembly
sbt/sbt publish-local

This seems slower than using make-distribution.sh, actually.

Is there a faster way to do this?

Nick

Re: Fastest way to build Spark from scratch

Posted by Jakob Odersky <jo...@gmail.com>.

make-distribution and the second code snippet both create a distribution
from a clean state. They therefore require that every source file be
compiled and that takes time (you can maybe tweak some settings or use a
newer compiler to gain some speed).

I'm inferring from your question that for your use-case deployment speed is
a critical issue, furthermore you'd like to build Spark for lots of
(every?) commit in a systematic way. In that case I would suggest you try
using the second code snippet without the `clean` task and only resort to
it if the build fails.

On my local machine, an assembly without a clean drops from 6 minutes to 2.

regards,
--Jakob

On 23 November 2015 at 20:18, Nicholas Chammas <ni...@gmail.com>
wrote:

> Say I want to build a complete Spark distribution against Hadoop 2.6+ as
> fast as possible from scratch.
>
> This is what I’m doing at the moment:
>
> ./make-distribution.sh -T 1C -Phadoop-2.6
>
> -T 1C instructs Maven to spin up 1 thread per available core. This takes
> around 20 minutes on an m3.large instance.
>
> I see that spark-ec2, on the other hand, builds Spark as follows
> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
> when you deploy Spark at a specific git commit:
>
> sbt/sbt clean assembly
> sbt/sbt publish-local
>
> This seems slower than using make-distribution.sh, actually.
>
> Is there a faster way to do this?
>
> Nick
> 
>