You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chen Song <ch...@gmail.com> on 2014/09/22 18:22:10 UTC

spark time out

I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
following exception.

java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec
        at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
        at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
        at scala.Option.foreach(Option.scala:236)
        at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
Situation is relieved but not too much. I am pretty confident it was not
due to GC on executors. What could be the reason for this?

Chen

Re: spark time out

Posted by Andrew Ash <an...@andrewash.com>.

Hi Chen,

The fetch failures seem to be happening a lot more to people on 1.1.0 --
there's a bug tracking fetch failures at
https://issues.apache.org/jira/browse/SPARK-3633 that might be the same as
what you're seeing.  Can you take a peek at that bug and if it matches what
you're observing follow it and vote for it?

There currently seem to be 3 things causing FetchFailures in 1.1:

1) long GCs on an executor (longer than
spark.core.connection.ack.wait.timeout default 60sec)
2) too many files open (hit kernel limits on ulimit -n)
3) some undetermined issue being tracked on that ticket

Hope that helps!
Andrew

On Tue, Sep 23, 2014 at 11:14 AM, Chen Song <ch...@gmail.com> wrote:

> I am running the job on 500 executors, each with 8G and 1 core.
>
> See lots of fetch failures on reduce stage, when running a simple
> reduceByKey
>
> map tasks -> 4000
> reduce tasks -> 200
>
>
>
> On Mon, Sep 22, 2014 at 12:22 PM, Chen Song <ch...@gmail.com>
> wrote:
>
>> I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
>> following exception.
>>
>> java.io.IOException: sendMessageReliably failed because ack was not
>> received within 60 sec
>>         at
>> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
>>         at
>> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
>>         at scala.Option.foreach(Option.scala:236)
>>         at
>> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
>>         at java.util.TimerThread.mainLoop(Timer.java:555)
>>         at java.util.TimerThread.run(Timer.java:505)
>>
>> I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
>> Situation is relieved but not too much. I am pretty confident it was not
>> due to GC on executors. What could be the reason for this?
>>
>> Chen
>>
>
>
>
> --
> Chen Song
>
>

Re: spark time out

Posted by Chen Song <ch...@gmail.com>.

I am running the job on 500 executors, each with 8G and 1 core.

See lots of fetch failures on reduce stage, when running a simple
reduceByKey

map tasks -> 4000
reduce tasks -> 200



On Mon, Sep 22, 2014 at 12:22 PM, Chen Song <ch...@gmail.com> wrote:

> I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
> following exception.
>
> java.io.IOException: sendMessageReliably failed because ack was not
> received within 60 sec
>         at
> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
>         at
> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
>         at scala.Option.foreach(Option.scala:236)
>         at
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
>
> I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
> Situation is relieved but not too much. I am pretty confident it was not
> due to GC on executors. What could be the reason for this?
>
> Chen
>



-- 
Chen Song