You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/15 08:44:44 UTC

[GitHub] [spark] Ngone51 commented on a change in pull request #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens

Ngone51 commented on a change in pull request #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens
URL: https://github.com/apache/spark/pull/24569#discussion_r284147066
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
 ##########
 @@ -485,8 +493,13 @@ private[deploy] class Worker(
       masterRef.send(WorkerSchedulerStateResponse(workerId, execs.toList, drivers.keys.toSeq))
 
     case ReconnectWorker(masterUrl) =>
-      logInfo(s"Master with url $masterUrl requested this worker to reconnect.")
-      registerWithMaster()
+      if (masterUrl != activeMasterUrl) {
 
 Review comment:
   > 2. `ReconnectWorker` may be sent by a standby master, as you explained in the PR description.
   
   I made a wrong PR description on step order in CASE 2(have revised). Sorry for it. Actually, while sending `ReconnectWorker`, Master A is still active but quickly going to die(as a race condition metioned above.)
   
   Actually, there's no doubt that the msg `ReconnectWorker(master)` must come from an active Master.
   So, when Worker receives that msg from Master X, cases would be:
   
   1) Master X is active
       1.1) Master X is the initial active master(No `MasterChanged` msg)
              1.1.1) master == activeMasterUrl 
                  just reonnect to (all) masters
              1.1.2)master != activeMasterUrl
                  impossible case
        1.2) Master X is elected to be new active master
               1.2.1)master == activeMasterUrl (`MasterChanged` comes before `ReconnectWorker`) 
                   just reonnect to (all) masters
               1.2.2)  master != activeMasterUrl (`MasterChanged` comes after `ReconnectWorker`)
                  seems very impossible, but can be a valid case  as you mentioned above. In this case, 
                  we'll always ignore the reconnect msg until we receive `MasterChanged`.
   2) Master X is in-active, Master Y takes over after Master X sends `ReconnectWorker`
       2.1) master == activeMasterUrl (`MasterChanged` from Y comes after `ReconnectWorker` from X)
           the active master has changed, but Worker haven't relaized the truth. It will still try to 
           reconnect to (all) masters. In this case(contrary to CASE 2), we'll hit duplicate register issue.
       2.2) master != activeMasterUrl (`MasterChanged` from Y comes before `ReconnectWorker` from X)    
           ignore it since Worker has already changed the active master to Master Y.
   
   
   **Since this PR suggests to change the result of worker duplicate register from exit to warn, so, I think it's ok if we remove this condition check here. Because the worst result by accepting `ReconnectWorker` is duplicate register to the active master, which is covered by this PR's fix solution.**
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org