You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Chen Song <ch...@gmail.com> on 2014/09/22 18:22:10 UTC
spark time out
I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
following exception.
java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec
at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
Situation is relieved but not too much. I am pretty confident it was not
due to GC on executors. What could be the reason for this?
Chen
Re: spark time out
Posted by Andrew Ash <an...@andrewash.com>.
Hi Chen,
The fetch failures seem to be happening a lot more to people on 1.1.0 --
there's a bug tracking fetch failures at
https://issues.apache.org/jira/browse/SPARK-3633 that might be the same as
what you're seeing. Can you take a peek at that bug and if it matches what
you're observing follow it and vote for it?
There currently seem to be 3 things causing FetchFailures in 1.1:
1) long GCs on an executor (longer than
spark.core.connection.ack.wait.timeout default 60sec)
2) too many files open (hit kernel limits on ulimit -n)
3) some undetermined issue being tracked on that ticket
Hope that helps!
Andrew
On Tue, Sep 23, 2014 at 11:14 AM, Chen Song <ch...@gmail.com> wrote:
> I am running the job on 500 executors, each with 8G and 1 core.
>
> See lots of fetch failures on reduce stage, when running a simple
> reduceByKey
>
> map tasks -> 4000
> reduce tasks -> 200
>
>
>
> On Mon, Sep 22, 2014 at 12:22 PM, Chen Song <ch...@gmail.com>
> wrote:
>
>> I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
>> following exception.
>>
>> java.io.IOException: sendMessageReliably failed because ack was not
>> received within 60 sec
>> at
>> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
>> at
>> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
>> at java.util.TimerThread.mainLoop(Timer.java:555)
>> at java.util.TimerThread.run(Timer.java:505)
>>
>> I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
>> Situation is relieved but not too much. I am pretty confident it was not
>> due to GC on executors. What could be the reason for this?
>>
>> Chen
>>
>
>
>
> --
> Chen Song
>
>
Re: spark time out
Posted by Chen Song <ch...@gmail.com>.
I am running the job on 500 executors, each with 8G and 1 core.
See lots of fetch failures on reduce stage, when running a simple
reduceByKey
map tasks -> 4000
reduce tasks -> 200
On Mon, Sep 22, 2014 at 12:22 PM, Chen Song <ch...@gmail.com> wrote:
> I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
> following exception.
>
> java.io.IOException: sendMessageReliably failed because ack was not
> received within 60 sec
> at
> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
> at
> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
>
> I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
> Situation is relieved but not too much. I am pretty confident it was not
> due to GC on executors. What could be the reason for this?
>
> Chen
>
--
Chen Song