You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "niranda perera (JIRA)" <ji...@apache.org> on 2016/04/19 22:55:25 UTC

[jira] [Created] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

niranda perera created SPARK-14736:
--------------------------------------

             Summary: Deadlock in registering applications while the Master is in the RECOVERING mode
                 Key: SPARK-14736
                 URL: https://issues.apache.org/jira/browse/SPARK-14736
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.4.1
         Environment: unix, Spark cluster with a custom StandaloneRecoveryModeFactory and a custom PersistenceEngine
            Reporter: niranda perera
            Priority: Critical


I have encountered the following issue in the standalone recovery mode. 

Let's say there was an application A running in the cluster. Due to some issue, the entire cluster, together with the application A goes down. 

Then later on, cluster comes back online, and the master then goes into the 'recovering' mode, because it sees some apps, workers and drivers have already been in the cluster from Persistence Engine. While in the recovery process, the application comes back online, but now it would have a different ID, let's say B. 

But then, as per the master, application registration logic, this application B will NOT be added to the 'waitingApps' with the message ""Attempted to re-register application at same address". [1]

  private def registerApplication(app: ApplicationInfo): Unit = {
    val appAddress = app.driver.address
    if (addressToApp.contains(appAddress)) {
      logInfo("Attempted to re-register application at same address: " + appAddress)
      return
    }


The problem here is, master is trying to recover application A, which is not in there anymore. Therefore after the recovery process, app A will be dropped. However app A's successor, app B was also omitted from the 'waitingApps' list because it had the same address as App A previously. 

This creates a deadlock in the cluster, app A nor app B is available in the cluster. 

When the master is in the RECOVERING mode, shouldn't it add all the registering apps to a list first, and then after the recovery is completed (once the unsuccessful recoveries are removed), deploy the apps which are new?

This would sort this deadlock IMO?

[1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org