You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:25:18 UTC

[jira] [Updated] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

     [ https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-14736:
---------------------------------
    Labels: bulk-closed  (was: )

> Deadlock in registering applications while the Master is in the RECOVERING mode
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-14736
>                 URL: https://issues.apache.org/jira/browse/SPARK-14736
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.1, 1.5.0, 1.6.0
>         Environment: unix, Spark cluster with a custom StandaloneRecoveryModeFactory and a custom PersistenceEngine
>            Reporter: niranda perera
>            Priority: Critical
>              Labels: bulk-closed
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some issue, the entire cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 'recovering' mode, because it sees some apps, workers and drivers have already been in the cluster from Persistence Engine. While in the recovery process, the application comes back online, but now it would have a different ID, let's say B. 
> But then, as per the master, application registration logic, this application B will NOT be added to the 'waitingApps' with the message ""Attempted to re-register application at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
>     val appAddress = app.driver.address
>     if (addressToApp.contains(appAddress)) {
>       logInfo("Attempted to re-register application at same address: " + appAddress)
>       return
>     }
> The problem here is, master is trying to recover application A, which is not in there anymore. Therefore after the recovery process, app A will be dropped. However app A's successor, app B was also omitted from the 'waitingApps' list because it had the same address as App A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the cluster. 
> When the master is in the RECOVERING mode, shouldn't it add all the registering apps to a list first, and then after the recovery is completed (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org