You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/05 00:44:41 UTC

[GitHub] [spark] kevin85421 opened a new pull request, #37411: [SPARK-39984][CORE] Check workerLastHeartbeat with master before HeartbeatReceiver expires an executor

kevin85421 opened a new pull request, #37411:
URL: https://github.com/apache/spark/pull/37411

### What changes were proposed in this pull request?
This PR aims to provide a method to lower the timeout. Our solution is to ask master for worker’s heartbeat when Driver does not receive heartbeat from executor for `TimeoutThreshold` seconds.

![Screen Shot 2022-08-01 at 6 10 56 PM](https://user-images.githubusercontent.com/20109646/182272331-2f972aa8-31c9-4c3e-8c88-b7cf2616fad8.png)

In Databricks, driver and master processes are running on the master node. Executor and worker processes are running on the worker node but different JVMs. Therefore, the network connection between driver/executor and master/worker is equivalent because they are running on same physical nodes.

When Executor performs full GC, it cannot send any message during full GC. Next, Driver cannot receive heartbeat from Executor. Instead of removing the executor directly, driver will ask master for `workerLastHeartbeat`. Driver will determine whether it is network disconnection or other issues (e.g. GC) based on `workerLastHeartbeat`. If it is network disconnection, we will remove the executor. Otherwise, we will put the executor into a waitingList rather than expiring it immediately.

* Result
![Screen Shot 2022-08-01 at 6 54 51 PM](https://user-images.githubusercontent.com/20109646/182275086-474d7458-f2a8-473c-adb2-c13f5c942ea1.png)

### Why are the changes needed?
Currently, the driver’s HeartbeatReceiver will expire an executor if it does not receive any heartbeat from the executor for 120 seconds. However, 120 seconds is too long, but we will face other challenges when we try to lower the timeout threshold. To elaborate, when an executor is performing GC, it cannot reply any message.

![Screen Shot 2022-08-01 at 6 03 50 PM](https://user-images.githubusercontent.com/20109646/182269820-4802877d-a4e4-4d20-969d-6ece37ffdb55.png)

We will use the above figure to explain why we cannot lower `TimeoutThreshold` to 60 seconds directly. When Executor performs full GC, it cannot send any message, including heartbeat. Next, driver will remove the executor because driver cannot receive heartbeat from Executor for 60 seconds. In other words, we cannot distinguish between GC and network disconnection.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
```
build/sbt "core/testOnly *HeartbeatReceiverSuite"
build/sbt "core/testOnly *MasterSuite"
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org