You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/09/14 08:04:20 UTC

[jira] [Resolved] (SPARK-17449) Relation between heartbeatInterval and network timeout

     [ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-17449.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 2.1.0

Issue resolved by pull request 15042
[https://github.com/apache/spark/pull/15042]

> Relation between heartbeatInterval and network timeout
> ------------------------------------------------------
>
>                 Key: SPARK-17449
>                 URL: https://issues.apache.org/jira/browse/SPARK-17449
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Yang Liang
>            Priority: Minor
>             Fix For: 2.1.0
>
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 ms exceeds timeout 120000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms exceeds timeout 10000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms exceeds timeout 10000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
>  * A heartbeat from executors to the driver. This is a shared message used by several internal
>  * components to convey liveness or execution information for in-progress tasks. It will also
>  * expire the hosts that have not heartbeated for more than spark.network.timeout.
>  */
> private val executorTimeoutMs =
>     sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 1000
> The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document at least. Otherwise error above would be confusing. Do some checks when get settings ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org