You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Guillaume Pitel <gu...@exensa.com> on 2014/01/23 11:56:39 UTC

Advices if your worker die often

Hi sparkers,

So I had this problem where my workers were dying or disappearing (and I had to 
manually kill -9 their processes) often. Sometimes during a computation, 
sometimes when I Ctrl-C'd the driver, sometimes right at the end of an 
application execution.

It seems that these tuning have solved the problem (in spark-env.sh):

export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"

export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"

Explanation : I've increased the timeout because I had this problem that the 
master was missing a heartbeat, thus removing the worker, and after that 
complaining that an unknown worker was sending heartbeats. I've also set the 
consolidateFiles option, because I noticed that deleting shuffle files in 
/tmp/spark-local* was taking forever because of the many files my job created.

I also added this to all my programs right after the creation of the 
sparkContext (sc = sparkContext) to cleanly shutdown when cancelling a job :

sys.addShutdownHook( { sc.stop() } )

Hope this can be useful to someone

Guillaume
-- 
eXenSa

	
*Guillaume PITEL, Président*
+33(0)6 25 48 86 80

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

heterogeneous cluster - problems setting spark.executor.memory

Posted by Yadid Ayzenberg <ya...@media.mit.edu>.

Hi Community,

Im running spark in standalone mode and In my current cluster each slave 
has 8GB of RAM.
I wanted to add one more powerful machine with 100GB of RAM as a slave 
to the cluster and encountered some difficulty.
If I don't set spark.executor.memory, all slaves will only allocate 
512MB of RAM to the job.
However, I cant set spark.executor.memory to be more than 8GB, otherwise 
my existing slaves will not be used.
It seems Spark was designed mainly for a homogeneous cluster. Can anyone 
suggest a way around this?

Thanks,

Yadid

Re: Advices if your worker die often

Posted by Debasish Das <de...@gmail.com>.

I have also seen that if one of users of the cluster writes some buggy code
the workers die...any idea if these fixes will also help in that scenario ?

If you write buggy yarn apps and the code fails on cluster, jvm don't
die....
On Jan 23, 2014 3:07 AM, "Sam Bessalah" <sa...@gmail.com> wrote:

> Definitely. Thanks. I usually just played around timeouts before. But this
> helps. Thx
>
>
> On Thu, Jan 23, 2014 at 11:56 AM, Guillaume Pitel <
> guillaume.pitel@exensa.com> wrote:
>
>>  Hi sparkers,
>>
>> So I had this problem where my workers were dying or disappearing (and I
>> had to manually kill -9 their processes) often. Sometimes during a
>> computation, sometimes when I Ctrl-C'd the driver, sometimes right at the
>> end of an application execution.
>>
>> It seems that these tuning have solved the problem (in spark-env.sh):
>>
>> export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
>>
>> export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
>>
>> Explanation : I've increased the timeout because I had this problem that
>> the master was missing a heartbeat, thus removing the worker, and after
>> that complaining that an unknown worker was sending heartbeats. I've also
>> set the consolidateFiles option, because I noticed that deleting shuffle
>> files in /tmp/spark-local* was taking forever because of the many files my
>> job created.
>>
>> I also added this to all my programs right after the creation of the
>> sparkContext (sc = sparkContext) to cleanly shutdown when cancelling a job
>> :
>>
>> sys.addShutdownHook( { sc.stop() } )
>>
>> Hope this can be useful to someone
>>
>> Guillaume
>> --
>>    [image: eXenSa]
>>  *Guillaume PITEL, Président*
>> +33(0)6 25 48 86 80
>>
>> eXenSa S.A.S. <http://www.exensa.com/>
>>  41, rue Périer - 92120 Montrouge - FRANCE
>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>>
>
>

Re: Advices if your worker die often

Posted by Sam Bessalah <sa...@gmail.com>.

Definitely. Thanks. I usually just played around timeouts before. But this
helps. Thx


On Thu, Jan 23, 2014 at 11:56 AM, Guillaume Pitel <
guillaume.pitel@exensa.com> wrote:

>  Hi sparkers,
>
> So I had this problem where my workers were dying or disappearing (and I
> had to manually kill -9 their processes) often. Sometimes during a
> computation, sometimes when I Ctrl-C'd the driver, sometimes right at the
> end of an application execution.
>
> It seems that these tuning have solved the problem (in spark-env.sh):
>
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
>
> export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
>
> Explanation : I've increased the timeout because I had this problem that
> the master was missing a heartbeat, thus removing the worker, and after
> that complaining that an unknown worker was sending heartbeats. I've also
> set the consolidateFiles option, because I noticed that deleting shuffle
> files in /tmp/spark-local* was taking forever because of the many files my
> job created.
>
> I also added this to all my programs right after the creation of the
> sparkContext (sc = sparkContext) to cleanly shutdown when cancelling a job
> :
>
> sys.addShutdownHook( { sc.stop() } )
>
> Hope this can be useful to someone
>
> Guillaume
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

Re: Advices if your worker die often

Posted by Manoj Samel <ma...@gmail.com>.

Thanks ! This is useful


On Thu, Jan 23, 2014 at 2:56 AM, Guillaume Pitel <guillaume.pitel@exensa.com
> wrote:

>  Hi sparkers,
>
> So I had this problem where my workers were dying or disappearing (and I
> had to manually kill -9 their processes) often. Sometimes during a
> computation, sometimes when I Ctrl-C'd the driver, sometimes right at the
> end of an application execution.
>
> It seems that these tuning have solved the problem (in spark-env.sh):
>
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
>
> export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
>
> Explanation : I've increased the timeout because I had this problem that
> the master was missing a heartbeat, thus removing the worker, and after
> that complaining that an unknown worker was sending heartbeats. I've also
> set the consolidateFiles option, because I noticed that deleting shuffle
> files in /tmp/spark-local* was taking forever because of the many files my
> job created.
>
> I also added this to all my programs right after the creation of the
> sparkContext (sc = sparkContext) to cleanly shutdown when cancelling a job
> :
>
> sys.addShutdownHook( { sc.stop() } )
>
> Hope this can be useful to someone
>
> Guillaume
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>