You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Punit Naik <na...@gmail.com> on 2016/10/04 00:07:22 UTC

Executor Lost error

Hi All

I am trying to run a program for a large dataset (~ 1TB). I have already
tested the code for low size of data and it works fine. But what I noticed
is that he job fails if the size of input is large. It was giving me errors
regarding akkka actor disassociation which I fixed by increasing the
timeouts.
But now I am getting errors like "execuyor lost" and "executor lost
failure" which I can't seem to figure out. These are my current set of
configs:

--conf "spark.network.timeout=30000"
--conf "spark.core.connection.ack.wait.timeout=30000"
--conf "spark.akka.timeout=30000"
--conf "spark.akka.askTimeout=30000"
--conf "spark.akka.frameSize=1000"
--conf "spark.storage.blockManagerSlaveTimeoutMs=600000"
--conf "spark.network.timeout=600"
--conf "spark.shuffle.memoryFraction=0.8"
--conf "spark.driver.maxResultSize=16g"
--conf "spark.driver.cores=10"
--conf "spark.driver.memory=10g"

Can anyone tell me any more configs to circumvent this "executor lost" and
"executor lost failure" error?

-- 
Thank You

Regards

Punit Naik

Re: Executor Lost error

Posted by Aditya <ad...@augmentiq.co.in>.

Got any solution for this?


On Tuesday 04 October 2016 05:37 AM, Punit Naik wrote:
> Hi All
>
> I am trying to run a program for a large dataset (~ 1TB). I have 
> already tested the code for low size of data and it works fine. But 
> what I noticed is that he job fails if the size of input is large. It 
> was giving me errors regarding akkka actor disassociation which I 
> fixed by increasing the timeouts.
> But now I am getting errors like "execuyor lost" and "executor lost 
> failure" which I can't seem to figure out. These are my current set of 
> configs:
>
> |--conf ||"spark.network.timeout=30000"|
> |--conf ||"spark.core.connection.ack.wait.timeout=30000"|
> |--conf ||"spark.akka.timeout=30000"|
> |--conf ||"spark.akka.askTimeout=30000"|
> |--conf ||"spark.akka.frameSize=1000"|
> |--conf ||"spark.storage.blockManagerSlaveTimeoutMs=600000"|
> |--conf ||"spark.network.timeout=600"|
> |--conf ||"spark.shuffle.memoryFraction=0.8"|
> |--conf ||"spark.driver.maxResultSize=16g"|
> |--conf ||"spark.driver.cores=10"|
> |--conf ||"spark.driver.memory=10g"|
>
> |Can anyone tell me any more configs to circumvent this "executor 
> lost" and "executor lost failure" error?|
>
> -- 
> Thank You
>
> Regards
>
> Punit Naik

Re: Executor Lost error

Posted by Nirav Patel <np...@xactlycorp.com>.

Few pointer from in addition:

1) Executor can also get lost if they hung up on GC and can't respond to
driver for timeout ms. That should be in executor logs though.
2) --conf "spark.shuffle.memoryFraction=0.8" that's very high shuffle
fraction. You should check UI for Event Timeline and exec logs to see
whether its failing on shuffle read or during computing or shuffle write
etc.

We supply 80GB of RAM for some of our spark workload ( < 2B records ). We
use spark 1.5. You can try spark 2.0 with DataSets if not already.



On Tue, Oct 4, 2016 at 6:39 AM, Yong Zhang <ja...@hotmail.com> wrote:

> You should check your executor log to identify the reason. My guess is
> that the executor is dead due to OOM.
>
>
> If it is the reason, then you need to tune your executor memory setting,
> or more important, your partitions count, to make sure you have enough
> memory to handle correct size of partition data.
>
>
> Yong
>
>
> ------------------------------
> *From:* Punit Naik <na...@gmail.com>
> *Sent:* Monday, October 3, 2016 8:07 PM
> *To:* user
> *Subject:* Executor Lost error
>
> Hi All
>
> I am trying to run a program for a large dataset (~ 1TB). I have already
> tested the code for low size of data and it works fine. But what I noticed
> is that he job fails if the size of input is large. It was giving me errors
> regarding akkka actor disassociation which I fixed by increasing the
> timeouts.
> But now I am getting errors like "execuyor lost" and "executor lost
> failure" which I can't seem to figure out. These are my current set of
> configs:
>
> --conf "spark.network.timeout=30000"
> --conf "spark.core.connection.ack.wait.timeout=30000"
> --conf "spark.akka.timeout=30000"
> --conf "spark.akka.askTimeout=30000"
> --conf "spark.akka.frameSize=1000"
> --conf "spark.storage.blockManagerSlaveTimeoutMs=600000"
> --conf "spark.network.timeout=600"
> --conf "spark.shuffle.memoryFraction=0.8"
> --conf "spark.driver.maxResultSize=16g"
> --conf "spark.driver.cores=10"
> --conf "spark.driver.memory=10g"
>
> Can anyone tell me any more configs to circumvent this "executor lost" and
> "executor lost failure" error?
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Executor Lost error

Posted by Yong Zhang <ja...@hotmail.com>.

You should check your executor log to identify the reason. My guess is that the executor is dead due to OOM.


If it is the reason, then you need to tune your executor memory setting, or more important, your partitions count, to make sure you have enough memory to handle correct size of partition data.


Yong


________________________________
From: Punit Naik <na...@gmail.com>
Sent: Monday, October 3, 2016 8:07 PM
To: user
Subject: Executor Lost error

Hi All

I am trying to run a program for a large dataset (~ 1TB). I have already tested the code for low size of data and it works fine. But what I noticed is that he job fails if the size of input is large. It was giving me errors regarding akkka actor disassociation which I fixed by increasing the timeouts.
But now I am getting errors like "execuyor lost" and "executor lost failure" which I can't seem to figure out. These are my current set of configs:

--conf "spark.network.timeout=30000"
--conf "spark.core.connection.ack.wait.timeout=30000"
--conf "spark.akka.timeout=30000"
--conf "spark.akka.askTimeout=30000"
--conf "spark.akka.frameSize=1000"
--conf "spark.storage.blockManagerSlaveTimeoutMs=600000"
--conf "spark.network.timeout=600"
--conf "spark.shuffle.memoryFraction=0.8"
--conf "spark.driver.maxResultSize=16g"
--conf "spark.driver.cores=10"
--conf "spark.driver.memory=10g"

Can anyone tell me any more configs to circumvent this "executor lost" and "executor lost failure" error?

--
Thank You

Regards

Punit Naik