You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Valentina Glagoleva (Jira)" <ji...@apache.org> on 2020/09/19 13:46:00 UTC
[jira] [Commented] (SPARK-16190) Worker registration failed: Duplicate worker ID

    [ https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198736#comment-17198736 ] 

Valentina Glagoleva commented on SPARK-16190:
---------------------------------------------

We faced similar issue and found possible problem. There is a possible race condition described in ticket https://issues.apache.org/jira/browse/SPARK-4592:
 (1) Master A fails and Worker attempts to reconnect to all masters
 (2) Master B takes over and notifies Worker
 (3) Worker responds by registering with Master B
 (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice

This race condition was partially fixed and worker tries now to reregister only with previously known master except for the first reregister try (usually first request times out as failover takes more than 10 seconds, but if failover happens fast enough then "duplicate worker" exception may happen).

Logs from the driver:
2020-08-04 06:23:57,976 INFO dispatcher-event-loop-2 org.apache.spark.deploy.worker.Worker: x.x.x.22:7077 Disassociated !
2020-08-04 06:23:57,976 ERROR dispatcher-event-loop-2 org.apache.spark.deploy.worker.Worker: Connection to master failed! Waiting for master to reconnect...
2020-08-04 06:23:57,977 INFO worker-register-master-threadpool-2 org.apache.spark.deploy.worker.Worker: Connecting to master x.x.x.21:7077...
2020-08-04 06:23:57,977 INFO dispatcher-event-loop-2 org.apache.spark.deploy.worker.Worker: x.x.x.22:7077 Disassociated !
2020-08-04 06:23:57,977 ERROR dispatcher-event-loop-2 org.apache.spark.deploy.worker.Worker: Connection to master failed! Waiting for master to reconnect...
...
2020-08-04 06:23:58,069 WARN worker-register-master-threadpool-3 org.apache.spark.deploy.worker.Worker: Failed to connect to master x.x.x.22:7077...
2020-08-04 06:23:58,090 INFO worker-register-master-threadpool-2 org.apache.spark.deploy.worker.Worker: Sending master registration message: x.x.x.21:7077
2020-08-04 06:23:58,100 INFO dispatcher-event-loop-2 org.apache.spark.deploy.worker.Worker: Master has changed, new master is at spark://x.x.x.21:70772020-08-04 06:23:58,100 INFO dispatcher-event-loop-2 org.apache.spark.deploy.worker.Worker: WorkerWebUI is available at http://x.x.x.21:8080/proxy/worker-20200804061729-x.x.x.47-407092020-08-04 06:23:58,100 INFO dispatcher-event-loop-2 org.apache.spark.deploy.worker.Worker: Cancelling registration entries
2020-08-04 06:23:58,100 INFO dispatcher-event-loop-2 org.apache.spark.deploy.worker.Worker: Cancelling is done
2020-08-04 06:23:58,102 INFO dispatcher-event-loop-2 org.apache.spark.deploy.worker.Worker: MasterChanged: Sending master spark://x.x.x.21:7077my id worker-20200804061729-x.x.x.47-407092020-08-04 06:23:58,125 ERROR dispatcher-event-loop-1 org.apache.spark.deploy.worker.Worker: Worker registration failed: Duplicate worker ID
2020-08-04 06:23:58,129 INFO shutdown-hook-0 org.apache.spark.deploy.worker.ExecutorRunner: Killing process!
Logs from the master:
2020-08-04 06:23:57,635 INFO dispatcher-event-loop-1-EventThread org.apache.spark.deploy.master.ZooKeeperLeaderElectionAgent: We have gained leadership
2020-08-04 06:23:57,787 INFO dispatcher-event-loop-1 org.apache.spark.deploy.master.Master: I have been elected leader! New state: RECOVERING
..
2020-08-04 06:23:57,870 INFO dispatcher-event-loop-1 org.apache.spark.deploy.master.Master: Trying to recover worker from the list of the workers I know: worker-20200804061729-x.x.x.47-40709
2020-08-04 06:23:57,870 INFO dispatcher-event-loop-1 org.apache.spark.deploy.master.Master: registerWorker: Added worker x.x.x.47, worker-20200804061729-x.x.x.47-40709 to list of idToWorker
..
2020-08-04 06:23:58,119 INFO dispatcher-event-loop-0 org.apache.spark.deploy.master.Master: Registering worker x.x.x.47:40709 with 4 cores, 14.7 GB RAM
2020-08-04 06:23:58,120 INFO dispatcher-event-loop-0 org.apache.spark.deploy.master.Master: Worker x.x.x.47:40709 is already in the list of known worker, sending error 'Duplicate worker ID', id worker-20200804061729-x.x.x.47-40709
2020-08-04 06:23:58,130 INFO dispatcher-event-loop-1 org.apache.spark.deploy.master.Master: Worker has been re-registered: worker-20200804061729-x.x.x.47-40709
2020-08-04 06:23:58,141 INFO dispatcher-event-loop-1 org.apache.spark.deploy.master.Master: Recovery complete - resuming operations!

> Worker registration failed: Duplicate worker ID
> -----------------------------------------------
>
>                 Key: SPARK-16190
>                 URL: https://issues.apache.org/jira/browse/SPARK-16190
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Thomas Huang
>            Priority: Minor
>         Attachments: spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave19.out, spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave2.out, spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out, spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out
>
>
> Several worker crashed simultaneously due to this error: 
> Worker registration failed: Duplicate worker ID
> This is the worker log on one of those crashed workers:
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor app-20160624003013-0442/26 interrupted
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: java.lang.UNIXProcess@31340137. This process will likely be orphaned.
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned.
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished with state KILLED
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished with state KILLED
> 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application app-20160624003013-0442
> 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application app-20160624003013-0442 removed, cleanupLocalDirs = true
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor app-20160624162905-0469/14 for SparkStreamingLRScala
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mqq); users with modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: "/data/jdk1.7.0_60/bin/java" "-cp" "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" "--hostname" "100.65.21.223" "--cores" "5" "--app-id" "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Successfully registered with master spark://100.65.21.199:7077
> 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor app-20160624153031-0467/27 for SparkRealtimeRecommender
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mqq); users with modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: "/data/jdk1.7.0_60/bin/java" "-cp" "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms61440M" "-Xmx61440M" "-Dspark.driver.port=50193" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@100.65.21.199:50193" "--executor-id" "27" "--hostname" "100.65.21.223" "--cores" "46" "--app-id" "app-20160624153031-0467" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 ERROR Worker: Worker registration failed: Duplicate worker ID
> 16/06/24 16:31:18 INFO ExecutorRunner: Killing process!
> 16/06/24 16:31:18 INFO ExecutorRunner: Killing process!
> 16/06/24 16:31:18 INFO ExecutorRunner: Killing process!
> 16/06/24 16:31:18 INFO ExecutorRunner: Killing process!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org