You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by wu...@apache.org on 2022/01/05 02:52:27 UTC

[spark] branch branch-3.0 updated: [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv

This is an automated email from the ASF dual-hosted git repository.

wuyi pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new be441e8  [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv
be441e8 is described below

commit be441e84069acc711ea848c69ae5bd55a7c93531
Author: yi.wu <yi...@databricks.com>
AuthorDate: Wed Jan 5 10:48:16 2022 +0800

    [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`.
    
    ### Why are the changes needed?
    
    `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. [...]
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Tested manually.
    
    Closes #35069 from Ngone51/fix-workerwatcher-exit.
    
    Authored-by: yi.wu <yi...@databricks.com>
    Signed-off-by: yi.wu <yi...@databricks.com>
    (cherry picked from commit 639d6f40e597d79c680084376ece87e40f4d2366)
    Signed-off-by: yi.wu <yi...@databricks.com>
---
 .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
index efffc9f..b7a5728 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
@@ -54,8 +54,12 @@ private[spark] class WorkerWatcher(
     if (isTesting) {
       isShutDown = true
     } else if (isChildProcessStopping.compareAndSet(false, true)) {
-      // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock
-      System.exit(-1)
+      // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock.
+      // Same as SPARK-14180, we should run `System.exit` in a separate thread to avoid
+      // dead lock since `System.exit` will trigger the shutdown hook of `executor.stop`.
+      new Thread("WorkerWatcher-exit-executor") {
+        override def run(): Unit = System.exit(-1)
+      }.start()
     }
 
   override def receive: PartialFunction[Any, Unit] = {

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org