You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Teng Qiu <te...@gmail.com> on 2016/05/02 01:54:55 UTC

Re: Spark on AWS

Hi, here we made several optimizations for accessing s3 from spark:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando

such as:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133

you can deploy our spark package using our docker image, just simply:

docker run -d --net=host \
           -e START_MASTER="true" \
           -e START_WORKER="true" \
           -e START_WEBAPP="true" \
           -e START_NOTEBOOK="true" \
           registry.opensource.zalan.do/bi/spark:1.6.2-6


a jupyter notebook will running on port 8888


have fun

Best,

Teng

2016-04-29 12:37 GMT+02:00 Steve Loughran <st...@hortonworks.com>:
>
> On 28 Apr 2016, at 22:59, Alexander Pivovarov <ap...@gmail.com> wrote:
>
> Spark works well with S3 (read and write). However it's recommended to set
> spark.speculation true (it's expected that some tasks fail if you read large
> S3 folder, so speculation should help)
>
>
>
> I must disagree.
>
> Speculative execution has >1 executor running the query, with whoever
> finishes first winning.
> however, "finishes first" is implemented in the output committer, by
> renaming the attempt's output directory to the final output directory:
> whoever renames first wins.
> This relies on rename() being implemented in the filesystem client as an
> atomic transaction.
> Unfortunately, S3 doesn't do renames. Instead every file gets copied to one
> of the new name, then the old file deleted; an operation that takes time
> O(data * files)
>
> if you have more than one executor trying to commit the work simultaneously,
> your output will be mess of both executions, without anything detecting and
> reporting it.
>
> Where did you find this recommendation to set speculation=true?
>
> -Steve
>
> see also: https://issues.apache.org/jira/browse/SPARK-10063

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark on AWS

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I agree with Steve, just start using vanilla SPARK EMR.

You can try to see point #4 here for dynamic allocation of executors
https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin
.

Note that dynamic allocation of executors takes a bit of time for the jobs
to start running, therefore you can provide another suggestion to EMR
clusters while starting so that they allocate maximum possible processing
to executors as the EMR clusters start using maximizeResourceAllocation as
mentioned here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html

In case you are trying to load enough data in the spark Master node for
graphing or exploratory analysis using Matlab, seaborn or bokeh its better
to increase the driver memory by recreating spark context.


Regards
Gourav Sengupta



On Mon, May 2, 2016 at 12:54 AM, Teng Qiu <te...@gmail.com> wrote:

> Hi, here we made several optimizations for accessing s3 from spark:
>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>
> such as:
>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133
>
> you can deploy our spark package using our docker image, just simply:
>
> docker run -d --net=host \
>            -e START_MASTER="true" \
>            -e START_WORKER="true" \
>            -e START_WEBAPP="true" \
>            -e START_NOTEBOOK="true" \
>            registry.opensource.zalan.do/bi/spark:1.6.2-6
>
>
> a jupyter notebook will running on port 8888
>
>
> have fun
>
> Best,
>
> Teng
>
> 2016-04-29 12:37 GMT+02:00 Steve Loughran <st...@hortonworks.com>:
> >
> > On 28 Apr 2016, at 22:59, Alexander Pivovarov <ap...@gmail.com>
> wrote:
> >
> > Spark works well with S3 (read and write). However it's recommended to
> set
> > spark.speculation true (it's expected that some tasks fail if you read
> large
> > S3 folder, so speculation should help)
> >
> >
> >
> > I must disagree.
> >
> > Speculative execution has >1 executor running the query, with whoever
> > finishes first winning.
> > however, "finishes first" is implemented in the output committer, by
> > renaming the attempt's output directory to the final output directory:
> > whoever renames first wins.
> > This relies on rename() being implemented in the filesystem client as an
> > atomic transaction.
> > Unfortunately, S3 doesn't do renames. Instead every file gets copied to
> one
> > of the new name, then the old file deleted; an operation that takes time
> > O(data * files)
> >
> > if you have more than one executor trying to commit the work
> simultaneously,
> > your output will be mess of both executions, without anything detecting
> and
> > reporting it.
> >
> > Where did you find this recommendation to set speculation=true?
> >
> > -Steve
> >
> > see also: https://issues.apache.org/jira/browse/SPARK-10063
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>