You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Williams, Ken" <Ke...@windlogics.com> on 2014/04/25 21:53:45 UTC

Build times for Spark

I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.

For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of CPU time).  After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU.

Is that typical?  Or does that indicate some setup problem in my environment?

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com


________________________________

CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.

Re: Build times for Spark

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

AFAIK the resolver does pick up things form your local ~/.m2 -- Note that
as ~/.m2 is on NFS that adds to the amount of filesystem traffic.

Shivaram


On Fri, Apr 25, 2014 at 2:57 PM, Williams, Ken
<Ke...@windlogics.com>wrote:

>  I am indeed, but it's a pretty fast NFS.  I don't have any SSD I can
> use, but I could try to use local disk to see what happens.
>
>
>
> For me, a large portion of the time seems to be spent on lines like
> "Resolving org.fusesource.jansi#jansi;1.4 ..." or similar .  Is this going
> out to find Maven resources?  Any way to tell it to just use my local ~/.m2
> repository instead when the resource already exists there?  Sometimes I
> even get sporadic errors like this:
>
>
>
>   [info] Resolving org.apache.hadoop#hadoop-yarn;2.2.0 ...
>
>   [error] SERVER ERROR: Bad Gateway url=
> http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-yarn-server/2.2.0/hadoop-yarn-server-2.2.0.jar
>
>
>
>
>
> -Ken
>
>
>
> *From:* Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu]
> *Sent:* Friday, April 25, 2014 4:31 PM
>
> *To:* user@spark.apache.org
> *Subject:* Re: Build times for Spark
>
>
>
> Are you by any chance building this on NFS ? As far as I know the build is
> severely bottlenecked by filesystem calls during assembly (each class file
> in each dependency gets a fstat call or something like that).  That is
> partly why building from say a local ext4 filesystem or a SSD is much
> faster irrespective of memory / CPU.
>
>
>
> Thanks
>
> Shivaram
>
>
>
> On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
> You can always increase the sbt memory by setting
>
> export JAVA_OPTS="-Xmx10g"
>
>
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken <
> Ken.Williams@windlogics.com> wrote:
>
> No, I haven't done any config for SBT.  Is there somewhere you might be
> able to point me toward for how to do that?
>
>
>
> -Ken
>
>
>
> *From:* Josh Rosen [mailto:rosenville@gmail.com]
> *Sent:* Friday, April 25, 2014 3:27 PM
> *To:* user@spark.apache.org
> *Subject:* Re: Build times for Spark
>
>
>
> Did you configure SBT to use the extra memory?
>
>
>
> On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken <
> Ken.Williams@windlogics.com> wrote:
>
> I've cloned the github repo and I'm building Spark on a pretty beefy
> machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.
>
>
>
> For instance, today I did a 'git pull' for the first time in a week or
> two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time
> (88 minutes of CPU time).  After that, I did 'SPARK_HADOOP_VERSION=2.2.0
> SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73
> minutes CPU.
>
>
>
> Is that typical?  Or does that indicate some setup problem in my
> environment?
>
>
>
> --
>
> Ken Williams, Senior Research Scientist
>
> *WindLogics*
>
> http://windlogics.com
>
>
>
>
>  ------------------------------
>
>
> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution of
> any kind is strictly prohibited. If you are not the intended recipient,
> please contact the sender via reply e-mail and destroy all copies of the
> original message. Thank you.
>
>
>
>
>
>
>

RE: Build times for Spark

Posted by "Williams, Ken" <Ke...@windlogics.com>.

I am indeed, but it's a pretty fast NFS.  I don't have any SSD I can use, but I could try to use local disk to see what happens.

For me, a large portion of the time seems to be spent on lines like "Resolving org.fusesource.jansi#jansi;1.4 ..." or similar .  Is this going out to find Maven resources?  Any way to tell it to just use my local ~/.m2 repository instead when the resource already exists there?  Sometimes I even get sporadic errors like this:

  [info] Resolving org.apache.hadoop#hadoop-yarn;2.2.0 ...
  [error] SERVER ERROR: Bad Gateway url=http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-yarn-server/2.2.0/hadoop-yarn-server-2.2.0.jar


-Ken

From: Shivaram Venkataraman [mailto:shivaram@eecs.berkeley.edu]
Sent: Friday, April 25, 2014 4:31 PM
To: user@spark.apache.org
Subject: Re: Build times for Spark

Are you by any chance building this on NFS ? As far as I know the build is severely bottlenecked by filesystem calls during assembly (each class file in each dependency gets a fstat call or something like that).  That is partly why building from say a local ext4 filesystem or a SSD is much faster irrespective of memory / CPU.

Thanks
Shivaram

On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das <ak...@sigmoidanalytics.com>> wrote:
You can always increase the sbt memory by setting

export JAVA_OPTS="-Xmx10g"



Thanks
Best Regards

On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken <Ke...@windlogics.com>> wrote:
No, I haven't done any config for SBT.  Is there somewhere you might be able to point me toward for how to do that?

-Ken

From: Josh Rosen [mailto:rosenville@gmail.com<ma...@gmail.com>]
Sent: Friday, April 25, 2014 3:27 PM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Build times for Spark

Did you configure SBT to use the extra memory?

On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken <Ke...@windlogics.com>> wrote:
I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.

For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of CPU time).  After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU.

Is that typical?  Or does that indicate some setup problem in my environment?

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com


________________________________

CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.

Re: Build times for Spark

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

Are you by any chance building this on NFS ? As far as I know the build is
severely bottlenecked by filesystem calls during assembly (each class file
in each dependency gets a fstat call or something like that).  That is
partly why building from say a local ext4 filesystem or a SSD is much
faster irrespective of memory / CPU.

Thanks
Shivaram


On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das <ak...@sigmoidanalytics.com>wrote:

> You can always increase the sbt memory by setting
>
> export JAVA_OPTS="-Xmx10g"
>
>
>
>
> Thanks
> Best Regards
>
>
> On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken <
> Ken.Williams@windlogics.com> wrote:
>
>>  No, I haven't done any config for SBT.  Is there somewhere you might be
>> able to point me toward for how to do that?
>>
>>
>>
>> -Ken
>>
>>
>>
>> *From:* Josh Rosen [mailto:rosenville@gmail.com]
>> *Sent:* Friday, April 25, 2014 3:27 PM
>> *To:* user@spark.apache.org
>> *Subject:* Re: Build times for Spark
>>
>>
>>
>> Did you configure SBT to use the extra memory?
>>
>>
>>
>> On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken <
>> Ken.Williams@windlogics.com> wrote:
>>
>> I've cloned the github repo and I'm building Spark on a pretty beefy
>> machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.
>>
>>
>>
>> For instance, today I did a 'git pull' for the first time in a week or
>> two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time
>> (88 minutes of CPU time).  After that, I did 'SPARK_HADOOP_VERSION=2.2.0
>> SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73
>> minutes CPU.
>>
>>
>>
>> Is that typical?  Or does that indicate some setup problem in my
>> environment?
>>
>>
>>
>> --
>>
>> Ken Williams, Senior Research Scientist
>>
>> *WindLogics*
>>
>> http://windlogics.com
>>
>>
>>
>>
>>  ------------------------------
>>
>>
>> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution of
>> any kind is strictly prohibited. If you are not the intended recipient,
>> please contact the sender via reply e-mail and destroy all copies of the
>> original message. Thank you.
>>
>>
>>
>
>

Re: Build times for Spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

You can always increase the sbt memory by setting

export JAVA_OPTS="-Xmx10g"





Thanks
Best Regards


On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken
<Ke...@windlogics.com>wrote:

>  No, I haven't done any config for SBT.  Is there somewhere you might be
> able to point me toward for how to do that?
>
>
>
> -Ken
>
>
>
> *From:* Josh Rosen [mailto:rosenville@gmail.com]
> *Sent:* Friday, April 25, 2014 3:27 PM
> *To:* user@spark.apache.org
> *Subject:* Re: Build times for Spark
>
>
>
> Did you configure SBT to use the extra memory?
>
>
>
> On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken <
> Ken.Williams@windlogics.com> wrote:
>
> I've cloned the github repo and I'm building Spark on a pretty beefy
> machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.
>
>
>
> For instance, today I did a 'git pull' for the first time in a week or
> two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time
> (88 minutes of CPU time).  After that, I did 'SPARK_HADOOP_VERSION=2.2.0
> SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73
> minutes CPU.
>
>
>
> Is that typical?  Or does that indicate some setup problem in my
> environment?
>
>
>
> --
>
> Ken Williams, Senior Research Scientist
>
> *WindLogics*
>
> http://windlogics.com
>
>
>
>
>  ------------------------------
>
>
> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution of
> any kind is strictly prohibited. If you are not the intended recipient,
> please contact the sender via reply e-mail and destroy all copies of the
> original message. Thank you.
>
>
>

RE: Build times for Spark

Posted by "Williams, Ken" <Ke...@windlogics.com>.

No, I haven’t done any config for SBT.  Is there somewhere you might be able to point me toward for how to do that?

-Ken

From: Josh Rosen [mailto:rosenville@gmail.com]
Sent: Friday, April 25, 2014 3:27 PM
To: user@spark.apache.org
Subject: Re: Build times for Spark

Did you configure SBT to use the extra memory?

On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken <Ke...@windlogics.com>> wrote:
I’ve cloned the github repo and I’m building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.

For instance, today I did a ‘git pull’ for the first time in a week or two, and then doing ‘sbt/sbt assembly’ took 43 minutes of wallclock time (88 minutes of CPU time).  After that, I did ‘SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly’ and that took 25 minutes wallclock, 73 minutes CPU.

Is that typical?  Or does that indicate some setup problem in my environment?

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com

________________________________

CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.

Re: Build times for Spark

Posted by Josh Rosen <ro...@gmail.com>.

Did you configure SBT to use the extra memory?


On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken <Ken.Williams@windlogics.com
> wrote:

>  I’ve cloned the github repo and I’m building Spark on a pretty beefy
> machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.
>
>
>
> For instance, today I did a ‘git pull’ for the first time in a week or
> two, and then doing ‘sbt/sbt assembly’ took 43 minutes of wallclock time
> (88 minutes of CPU time).  After that, I did ‘SPARK_HADOOP_VERSION=2.2.0
> SPARK_YARN=true sbt/sbt assembly’ and that took 25 minutes wallclock, 73
> minutes CPU.
>
>
>
> Is that typical?  Or does that indicate some setup problem in my
> environment?
>
>
>
> --
>
> Ken Williams, Senior Research Scientist
>
> *Wind**Logics*
>
> http://windlogics.com
>
>
>
> ------------------------------
>
> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution of
> any kind is strictly prohibited. If you are not the intended recipient,
> please contact the sender via reply e-mail and destroy all copies of the
> original message. Thank you.
>

Re: Build times for Spark

Posted by DB Tsai <db...@stanford.edu>.

Are you using SSD? We found that the bottleneck is not computational, but
disk IO. When assembly, sbt is moving lots of class files, jars, and
packaging them into a single flat jar. I can do assembly in my macbook in
10mins while before upgrading to SSD, it took 30~40mins.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken <Ken.Williams@windlogics.com
> wrote:

>  I’ve cloned the github repo and I’m building Spark on a pretty beefy
> machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.
>
>
>
> For instance, today I did a ‘git pull’ for the first time in a week or
> two, and then doing ‘sbt/sbt assembly’ took 43 minutes of wallclock time
> (88 minutes of CPU time).  After that, I did ‘SPARK_HADOOP_VERSION=2.2.0
> SPARK_YARN=true sbt/sbt assembly’ and that took 25 minutes wallclock, 73
> minutes CPU.
>
>
>
> Is that typical?  Or does that indicate some setup problem in my
> environment?
>
>
>
> --
>
> Ken Williams, Senior Research Scientist
>
> *Wind**Logics*
>
> http://windlogics.com
>
>
>
> ------------------------------
>
> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution of
> any kind is strictly prohibited. If you are not the intended recipient,
> please contact the sender via reply e-mail and destroy all copies of the
> original message. Thank you.
>