You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2014/11/25 07:15:12 UTC

[jira] [Commented] (SPARK-4592) "Worker registration failed: Duplicate worker ID" error during Master failover

    [ https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224108#comment-14224108 ] 

Josh Rosen commented on SPARK-4592:
-----------------------------------

It looks like this is a bug that was introduced in [the patch|https://github.com/apache/spark/pull/2828] for SPARK-3736.  Prior to that patch, a worker that became disassociated from a master would wait for a master to initiate a reconnection.  This behavior caused problems when a master failed by stopping and restarting, since the restarted master would never know to initiate a reconnection.  That patch addressed this issue by having workers attempt to reconnect to the master if they think they've become disconnected.  When reviewing that patch, I wrote a [long comment|https://github.com/apache/spark/pull/2828#issuecomment-59602394] that explains it in much more detail; that's a good reference for understanding its motivation.

This change introduced a problem, though: there's now a race-condition during multi-master failover.  In the old multi-master code, a worker never initiates a reconnection attempt to a master; instead, it reconnects after the new active / primary master tells the worker that it's the new master.  With the addition of worker-initiated reconnect, there's a race-condition where the worker detects that it's become disconnected, goes down the list of known masters and tries to connect to each of them, successfully connects to the new primary master, then receives a {{ChangedMaster}} event and attempts to connect to the new primary master _even though it's already connected_, causing a duplicate worker registration.

There are a number of ways that we might fix this, but we have to be careful because it seems likely that the worker-initiated reconnect could have introduced other problems:

- What happens if a worker sends a reconnection attempt to a live master which is not the new primary?  Will that non-primary master reject or redirect those registrations, or will it register the workers and cause a split-brain scenario to occur?
- The Worker is implemented as an actor and thus does not have synchronization of its internal state since it assumes message-at-a-time processing, but the asynchronous re-registration timer thread may violate this assumption because it directly calls internal worker methods instead of sending messages to the worker's own mailbox.

One simple fix might be to have the worker never initiate reconnection attempts when running in a multi-master environment.  I still need to think through whether this will cause new problems similar to SPARK-3736.  I don't think it will be a problem because that patch was motivated by cases where the master forgot who the worker was and couldn't initiate a reconnect.  If the list of registered workers is stored durably in ZooKeeper such that a worker is never told that it has registered until a record of its registration has become durable, then I think this is fine: if a live master thinks that a worker has disconnected, then it will initiate a reconnection; when a new master fails over, it will reconnect all workers based on the list from ZooKeeper. 

> "Worker registration failed: Duplicate worker ID" error during Master failover
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-4592
>                 URL: https://issues.apache.org/jira/browse/SPARK-4592
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Blocker
>
> When running Spark Standalone in high-availability mode, we sometimes see "Worker registration failed: Duplicate worker ID" errors which prevent workers from reconnecting to the new active master.  I've attached full logs from a reproduction in my integration tests suite (which runs something similar to Spark's FaultToleranceTest).  Here's the relevant excerpt from a worker log during a failed run of the "rolling outage" test, which creates a multi-master cluster then repeatedly kills the active master, waits for workers to reconnect to a new active master, then kills that master, and so on.
> {code}
> 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at http://172.17.0.90:8081
> 14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.86:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.87:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.88:7077...
> 14/11/23 02:23:02 INFO Worker: Successfully registered with master spark://172.17.0.86:7077
> 14/11/23 02:23:03 INFO Worker: Asked to launch executor app-20141123022303-0000/1 for spark-integration-tests
> 14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp" "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar" "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://sparkDriver@joshs-mbp.att.net:51271/user/CoarseGrainedScheduler" "1" "172.17.0.90" "8" "app-20141123022303-0000" "akka.tcp://sparkWorker@172.17.0.90:8888/user/Worker"
> 14/11/23 02:23:14 INFO Worker: Disassociated [akka.tcp://sparkWorker@172.17.0.90:8888] -> [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
> 14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.86:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.87:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.88:7077...
> 14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster@172.17.0.86:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/11/23 02:23:14 INFO Worker: Disassociated [akka.tcp://sparkWorker@172.17.0.90:8888] -> [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
> 14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from Actor[akka://sparkWorker/user/Worker#-1246122173] to Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
> 14/11/23 02:23:14 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1)
> 14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.86:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.87:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.88:7077...
> 14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2)
> 14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.86:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.87:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.88:7077...
> 14/11/23 02:23:42 INFO Worker: Master has changed, new master is at spark://172.17.0.87:7077
> 14/11/23 02:23:47 INFO Worker: Retrying connection to master (attempt # 3)
> 14/11/23 02:23:47 INFO Worker: Connecting to master spark://172.17.0.86:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master spark://172.17.0.87:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master spark://172.17.0.88:7077...
> 14/11/23 02:23:47 ERROR Worker: Worker registration failed: Duplicate worker ID
> 14/11/23 02:23:47 INFO ExecutorRunner: Killing process!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org