You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/04/05 02:52:36 UTC

Heartbeat exceeds

Hi,

In my ALS runs I am noticing messages that complain about heart beats:

14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms
exceeds 45000ms
14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(12, machine2, 60714, 0) with no recent heart beats: 45328ms
exceeds 45000ms
14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(19, machine3, 39496, 0) with no recent heart beats: 53259ms
exceeds 45000ms

Is this some issue with the underlying jvm over which akka is run ? Can I
increase the heartbeat somehow to get these messages resolved ?

Any more insight about the possible cause for the heartbeat will be
helpful...

It tried to re-run the job but it ultimately failed...

Also I am noticing negative numbers in the stage duration:




Any insights into the problem will be very helpful...

Thanks.
Deb

Re: Heartbeat exceeds

Posted by Debasish Das <de...@gmail.com>.

@patrick I think there is a bug...when this timeout happens then suddenly I
see some negative ms numbers in spark ui....I tried to send a pic showing
the negative ms numbers but it was rejected by mailing list...I will send
it your gmail...

>From the archive I saw some more suggestions:

>>

It seems that these tuning have solved the problem (in spark-env.sh):

export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600
-Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"

export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600
-Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"

>>

For more understanding, it will be great if you could explain how are
these two -Dspark.worker.timeout=600 -Dspark.akka.timeout=200
different than spark.storage.blockManagerSlaveTimeoutsMs

Also what's the difference between worker timeout and akka timeout ?

Thanks.
Deb

On Fri, Apr 4, 2014 at 10:07 PM, Patrick Wendell <pw...@gmail.com> wrote:

> If you look in the Spark UI, do you see any garbage collection happening?
> My best guess is that some of the executors are going into GC and they are
> timing out. You can manually increase the timeout by setting the Spark conf:
>
> spark.storage.blockManagerSlaveTimeoutMs
>
> to a higher value. In your case it's setting this to 45000 or 45 seconds.
>
>
>
>
> On Fri, Apr 4, 2014 at 5:52 PM, Debasish Das <de...@gmail.com>wrote:
>
>> Hi,
>>
>> In my ALS runs I am noticing messages that complain about heart beats:
>>
>> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
>> BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms
>> exceeds 45000ms
>> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
>> BlockManagerId(12, machine2, 60714, 0) with no recent heart beats: 45328ms
>> exceeds 45000ms
>> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
>> BlockManagerId(19, machine3, 39496, 0) with no recent heart beats: 53259ms
>> exceeds 45000ms
>>
>> Is this some issue with the underlying jvm over which akka is run ? Can I
>> increase the heartbeat somehow to get these messages resolved ?
>>
>> Any more insight about the possible cause for the heartbeat will be
>> helpful...
>>
>> It tried to re-run the job but it ultimately failed...
>>
>> Also I am noticing negative numbers in the stage duration:
>>
>>
>>
>>
>> Any insights into the problem will be very helpful...
>>
>> Thanks.
>> Deb
>>
>
>

Re: Heartbeat exceeds

Posted by Patrick Wendell <pw...@gmail.com>.

If you look in the Spark UI, do you see any garbage collection happening?
My best guess is that some of the executors are going into GC and they are
timing out. You can manually increase the timeout by setting the Spark conf:

spark.storage.blockManagerSlaveTimeoutMs

to a higher value. In your case it's setting this to 45000 or 45 seconds.




On Fri, Apr 4, 2014 at 5:52 PM, Debasish Das <de...@gmail.com>wrote:

> Hi,
>
> In my ALS runs I am noticing messages that complain about heart beats:
>
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms
> exceeds 45000ms
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(12, machine2, 60714, 0) with no recent heart beats: 45328ms
> exceeds 45000ms
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(19, machine3, 39496, 0) with no recent heart beats: 53259ms
> exceeds 45000ms
>
> Is this some issue with the underlying jvm over which akka is run ? Can I
> increase the heartbeat somehow to get these messages resolved ?
>
> Any more insight about the possible cause for the heartbeat will be
> helpful...
>
> It tried to re-run the job but it ultimately failed...
>
> Also I am noticing negative numbers in the stage duration:
>
>
>
>
> Any insights into the problem will be very helpful...
>
> Thanks.
> Deb
>

Re: Heartbeat exceeds

Posted by Andrew Or <an...@databricks.com>.

Setting spark.worker.timeout should not help you. What this value means is
that the master checks every 60 seconds whether the workers are still
alive, as the documentation describes. But this value also determines how
often the workers send HEARTBEAT messages to notify the master of their
liveness; in particular, under this configuration, the workers send such
messages every 60 / 4 = 15 seconds. Increasing this value means it takes
longer (i.e. 600 seconds) in your case to detect that something went wrong
in the first place.

spark.storage.blockManagerSlaveTimeoutMs is similar. It controls how
frequently the HEARTBEAT messages are sent and how frequently they are
expected to arrive. Under default parameters, the driver checks every 45
seconds whether the block managers (living on the executors) are still
alive, and each block manager sends a HEARTBEAT to the driver every 15
seconds.

If anything, increasing spark.akka.timeout is closest to what you want. It
gives more leeway for the communication the driver and the executors, such
that if the executors take longer than usual to respond, the currently
running task does not just give up after 100 seconds (the default).

However, it seems that the root cause of the problem is in your
application's use of memory. Are you caching a lot of RDD's? You can find
out more details about what went wrong exactly by going through the worker
logs on <master_url>:8080. The timeout exception that you ran into is
usually a side-effect of a deeper, underlying exception.

On Sat, Apr 5, 2014 at 9:33 AM, Debasish Das <de...@gmail.com>wrote:

> This does not seem to help:
>
> export SPARK_JAVA_OPTS="-Dspark.local.dir=/app/spark/tmp
> -Dspark.worker.timeout=600 -Dspark.akka.timeout=200
> -Dspark.storage.blockManagerSlaveTimeoutMs=300000"
>
> Getting the message leads to GC failure followed by master declaring the
> worker as dead !
>
> This is related to GC...Persisting the factors to disk at each iteration
> will resolve this issue with runtime loss of course...
>
> I also have another issue...I run with executor memory as 24g but I see
> 18.4 GB in executor ui...is that expected ?
>
>
> On Sat, Apr 5, 2014 at 8:16 AM, Debasish Das <de...@gmail.com>wrote:
>
>> From the documentation this is what I understood:
>>
>> 1. spark.worker.timeout: Number of seconds after which the standalone
>> deploy master considers a worker lost if it receives no heartbeats.
>> default: 60
>>
>> I increased it to be 600
>>
>> It was pointed before that if there is GC overload and the worker takes
>> time to respond, master thinks worker JVM died.
>>
>> I have seen this issue as well several times.
>>
>> 2. spark.akka.timeout: Communication timeout between Spark nodes, in
>> seconds.
>> default: 100
>>
>> I increased it to 200 as it was pointed before but I don't understand
>> when the communication timeout is triggered. Some explanation on this
>> setting will be very helpful.
>>
>> 3. spark.storage.blockManagerSlaveTimeoutMs: I could not find
>> documentation but as Patrick said the 45000 number coming from this.
>>
>> How is this related to spark.worker.timeout?
>>
>> I bumped it up to 300s but JVM can go to GC only if there is pressure on
>> JVM right....May be I need to do a yourkit run to understand the memory
>> usage in more detail. Any suggestions on how to setup yourkit for memory
>> analysis ?
>>
>> I set it using the following options in spark_env.sh:
>>
>> export SPARK_JAVA_OPTS="-Dspark.local.dir=/app/spark/tmp
>> -Dspark.storage.blockManagerSlaveTimeoutMs=300000
>> -Dspark.worker.timeout=600 -Dspark.akka.timeout=200"
>>
>>
>>
>> This is the correct way to specify
>> spark.storage.blockManagerSlaveTimeoutMs ?
>>
>>
>> On Sat, Apr 5, 2014 at 4:00 AM, azurecoder <ri...@elastacloud.com>wrote:
>>
>>> Interested in a resolution to this. I'm building a large triangular
>>> matrix so
>>> doing similar to ALS - lots of work on the worker nodes and keep timing
>>> out.
>>>
>>> Tried a few updates to akka frame sizes, timeouts and blockmanager but
>>> unable to complete. Will try the blockmanagerslaves property now and let
>>> you
>>> know the effect. That property doesn't appear to be documented on the
>>> site
>>> though.
>>>
>>> Cheers!
>>>
>>> Richard
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-tp3798p3809.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>

Re: Heartbeat exceeds

Posted by Debasish Das <de...@gmail.com>.

This does not seem to help:

export SPARK_JAVA_OPTS="-Dspark.local.dir=/app/spark/tmp
-Dspark.worker.timeout=600 -Dspark.akka.timeout=200
-Dspark.storage.blockManagerSlaveTimeoutMs=300000"

Getting the message leads to GC failure followed by master declaring the
worker as dead !

This is related to GC...Persisting the factors to disk at each iteration
will resolve this issue with runtime loss of course...

I also have another issue...I run with executor memory as 24g but I see
18.4 GB in executor ui...is that expected ?


On Sat, Apr 5, 2014 at 8:16 AM, Debasish Das <de...@gmail.com>wrote:

> From the documentation this is what I understood:
>
> 1. spark.worker.timeout: Number of seconds after which the standalone
> deploy master considers a worker lost if it receives no heartbeats.
> default: 60
>
> I increased it to be 600
>
> It was pointed before that if there is GC overload and the worker takes
> time to respond, master thinks worker JVM died.
>
> I have seen this issue as well several times.
>
> 2. spark.akka.timeout: Communication timeout between Spark nodes, in
> seconds.
> default: 100
>
> I increased it to 200 as it was pointed before but I don't understand when
> the communication timeout is triggered. Some explanation on this setting
> will be very helpful.
>
> 3. spark.storage.blockManagerSlaveTimeoutMs: I could not find
> documentation but as Patrick said the 45000 number coming from this.
>
> How is this related to spark.worker.timeout?
>
> I bumped it up to 300s but JVM can go to GC only if there is pressure on
> JVM right....May be I need to do a yourkit run to understand the memory
> usage in more detail. Any suggestions on how to setup yourkit for memory
> analysis ?
>
> I set it using the following options in spark_env.sh:
>
> export SPARK_JAVA_OPTS="-Dspark.local.dir=/app/spark/tmp
> -Dspark.storage.blockManagerSlaveTimeoutMs=300000
> -Dspark.worker.timeout=600 -Dspark.akka.timeout=200"
>
>
>
> This is the correct way to specify
> spark.storage.blockManagerSlaveTimeoutMs ?
>
>
> On Sat, Apr 5, 2014 at 4:00 AM, azurecoder <ri...@elastacloud.com>wrote:
>
>> Interested in a resolution to this. I'm building a large triangular
>> matrix so
>> doing similar to ALS - lots of work on the worker nodes and keep timing
>> out.
>>
>> Tried a few updates to akka frame sizes, timeouts and blockmanager but
>> unable to complete. Will try the blockmanagerslaves property now and let
>> you
>> know the effect. That property doesn't appear to be documented on the site
>> though.
>>
>> Cheers!
>>
>> Richard
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-tp3798p3809.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Heartbeat exceeds

Posted by Debasish Das <de...@gmail.com>.

>From the documentation this is what I understood:

1. spark.worker.timeout: Number of seconds after which the standalone
deploy master considers a worker lost if it receives no heartbeats.
default: 60

I increased it to be 600

It was pointed before that if there is GC overload and the worker takes
time to respond, master thinks worker JVM died.

I have seen this issue as well several times.

2. spark.akka.timeout: Communication timeout between Spark nodes, in
seconds.
default: 100

I increased it to 200 as it was pointed before but I don't understand when
the communication timeout is triggered. Some explanation on this setting
will be very helpful.

3. spark.storage.blockManagerSlaveTimeoutMs: I could not find documentation
but as Patrick said the 45000 number coming from this.

How is this related to spark.worker.timeout?

I bumped it up to 300s but JVM can go to GC only if there is pressure on
JVM right....May be I need to do a yourkit run to understand the memory
usage in more detail. Any suggestions on how to setup yourkit for memory
analysis ?

I set it using the following options in spark_env.sh:

export SPARK_JAVA_OPTS="-Dspark.local.dir=/app/spark/tmp
-Dspark.storage.blockManagerSlaveTimeoutMs=300000
-Dspark.worker.timeout=600 -Dspark.akka.timeout=200"

This is the correct way to specify spark.storage.blockManagerSlaveTimeoutMs
?

On Sat, Apr 5, 2014 at 4:00 AM, azurecoder <ri...@elastacloud.com> wrote:

> Interested in a resolution to this. I'm building a large triangular matrix
> so
> doing similar to ALS - lots of work on the worker nodes and keep timing
> out.
>
> Tried a few updates to akka frame sizes, timeouts and blockmanager but
> unable to complete. Will try the blockmanagerslaves property now and let
> you
> know the effect. That property doesn't appear to be documented on the site
> though.
>
> Cheers!
>
> Richard
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-tp3798p3809.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Heartbeat exceeds

Posted by azurecoder <ri...@elastacloud.com>.

Interested in a resolution to this. I'm building a large triangular matrix so
doing similar to ALS - lots of work on the worker nodes and keep timing out.

Tried a few updates to akka frame sizes, timeouts and blockmanager but
unable to complete. Will try the blockmanagerslaves property now and let you
know the effect. That property doesn't appear to be documented on the site
though.

Cheers!

Richard 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-tp3798p3809.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.