You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/09 15:59:19 UTC

[GitHub] [spark] Ngone51 opened a new pull request #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens

Ngone51 opened a new pull request #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens
URL: https://github.com/apache/spark/pull/24569

## What changes were proposed in this pull request?

#2828 firstly introduces the bug of *duplicate Worker register*, and #3447 fixed it. But there's still corner cases(see SPARK-23191 for details) where #3447 can not cover it:

* CASE 1
(1) Master A disconnected(e.g. due to network drop), but not fail
(2) Worker attempts to reconnect to all masters
(3) during the time, network recover, Worker register to Master A again
(4) Master A response with `RegisterWorkerFailed("Duplicate worker ID")`
(5) Worker receives that msg, exit

* CASE 2
(1) Master A lost leadership, Master B takes overs
(2) Worker receives `MasterChanged`, change masterRef to Master B
(3) Master A receives a late HeartBeat from Worker, asking it to reconnect
(4) Worker receives `ReconnectWorker` from Master A, reconnect to all masters
(5) Master B receives register request again from the Worker, response with `RegisterWorkerFailed("Duplicate worker ID")`
(6) Worker receives that msg, exit

For CASE 2, we could avoid duplicate worker register by comparing the current active master url and requested url in step (4).

For CASE 1, I found it would be troublesome to avoid *duplicate Worker register* in this special cases. So, this pr suggests to log warn rather than terminate the worker process when duplicate worker register happens. In this way, worker and master could also work well with each other.

## How was this patch tested?

Tested Manually.

I followed the steps as Neeraj Gupta suggested in JIRA SPARK-23191 to reproduce the case 1.

Before this pr, Worker would be DEAD from UI.
After this pr, Worker just warn the duplicate register behavior (as you can see the second last row in log snippet below), and still be ALIVE from UI.

```
19/05/09 20:58:32 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
19/05/09 20:58:32 INFO Worker: wuyi.local:7077 Disassociated !
19/05/09 20:58:32 INFO Worker: Connecting to master wuyi.local:7077...
19/05/09 20:58:32 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
19/05/09 20:58:32 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
19/05/09 20:58:37 WARN TransportClientFactory: DNS resolution for wuyi.local/127.0.0.1:7077 took 5005 ms
19/05/09 20:58:37 INFO TransportClientFactory: Found inactive connection to wuyi.local/127.0.0.1:7077, creating a new one.
19/05/09 20:58:37 INFO TransportClientFactory: Successfully created connection to wuyi.local/127.0.0.1:7077 after 3 ms (0 ms spent in bootstraps)
19/05/09 20:58:37 WARN Worker: Duplicate registration at master spark://wuyi.local:7077
19/05/09 20:58:37 INFO Worker: Successfully registered with master spark://wuyi.local:7077
```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org