You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/08/11 16:58:20 UTC

[jira] [Assigned] (SPARK-17022) Potential deadlock in driver handling message

     [ https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-17022:
------------------------------------

    Assignee:     (was: Apache Spark)

> Potential deadlock in driver handling message
> ---------------------------------------------
>
>                 Key: SPARK-17022
>                 URL: https://issues.apache.org/jira/browse/SPARK-17022
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
>            Reporter: Tao Wang
>            Priority: Critical
>
> Suggest t1 < t2 < t3 
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one of three functions: CoarseGrainedSchedulerBackend.killExecutors, CoarseGrainedSchedulerBackend.requestTotalExecutors or CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor message.
> When handling RemoveExecutor, it would send the same message to `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make it handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org