You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "littlelittlewhite09 (via GitHub)" <gi...@apache.org> on 2024/03/20 12:46:15 UTC

[PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

littlelittlewhite09 opened a new pull request, #45613:
URL: https://github.com/apache/spark/pull/45613

   ### What changes were proposed in this pull request?
   The issue is open on JIRA. Link:[SPARK-47488](https://issues.apache.org/jira/browse/SPARK-47488)
   
   ### Why are the changes needed?
   On k8s-client mode, if encounters exception and thread pool is not closed, driver pod may stuck. This pr is used to fix this issue.
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   UT
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "littlelittlewhite09 (via GitHub)" <gi...@apache.org>.
littlelittlewhite09 commented on PR #45613:
URL: https://github.com/apache/spark/pull/45613#issuecomment-2012553696

   @dongjoon-hyun @yaooqinn Thx for your review. I am sorry that I have made so many fatal mistake, and now I re-commit the code. Let me rephrase this issue. When the thread pool is not closed, if running in yarn cluster mode, the driver can exit normally when encountering an exception. But, if running yarn-client mode or on k8s, driver will get stuck when encountering an exception. In practice, when we migrate spark app from yarn-cluster to k8s, we encounter this issue. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45613:
URL: https://github.com/apache/spark/pull/45613#discussion_r1532199096


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -984,20 +984,39 @@ private[spark] class SparkSubmit extends Logging {
         e
     }
 
+    //    this variable is used to judge whether driver pod is normal if spark is on k8s
+    var DriverPodIsNormal: Boolean = if (args.master.startsWith("k8s")) true else false
+    var driverThrow: Throwable = null
     try {
       app.start(childArgs.toArray, sparkConf)
     } catch {
       case t: Throwable =>
-        throw findCause(t)
+        logWarning("Some ERR/Exception happened when app is running.")
+        if (args.master.startsWith("k8s")) {
+          DriverPodIsNormal = false
+          driverThrow = t
+        } else {
+          throw findCause(t)
+        }
     } finally {
       if (args.master.startsWith("k8s") && !isShell(args.primaryResource) &&
           !isSqlShell(args.mainClass) && !isThriftServer(args.mainClass) &&
           !isConnectServer(args.mainClass)) {
         try {
+          logWarning("Begin to close SparkContext inside driver pod......")
           SparkContext.getActive.foreach(_.stop())
         } catch {
           case e: Throwable => logError(s"Failed to close SparkContext: $e")
-        }
+        } finally {
+          if (SparkContext.getActive.isEmpty) {
+            logWarning("Finished to close SparkContext inside driver pod successfully.")
+            if (!DriverPodIsNormal) {
+              logError(s"Driver Pod will exit because: $driverThrow")
+              System.exit(1)
+            }
+          } else {
+            logWarning("Failed to close SparkContext.")
+          }

Review Comment:
   Indentation is correct? Line 1019 and 1020 seems to have some gap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "littlelittlewhite09 (via GitHub)" <gi...@apache.org>.
littlelittlewhite09 commented on PR #45613:
URL: https://github.com/apache/spark/pull/45613#issuecomment-2012650986

   > I checked your example code in the JIRA and it looks weird. Ideally, you should stop the non-daemon threads for a JVM to exit, and stop SparkContext to terminate the app
   
   When thread pool is not closed, driver can stop if app is on yarn-cluser. But it does not on k8s. I'm not sure if this situation is reasonable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45613:
URL: https://github.com/apache/spark/pull/45613#issuecomment-2009763812

   BTW, Apache Spark 4.0.0 supports three resource managers, `Spark Standalone`, `YARN`, `K8S`. Do you happen to know what is the behavior of `Spark Standalone` resource manager?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "littlelittlewhite09 (via GitHub)" <gi...@apache.org>.
littlelittlewhite09 closed pull request #45613: [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down
URL: https://github.com/apache/spark/pull/45613


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45613:
URL: https://github.com/apache/spark/pull/45613#discussion_r1532200417


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -984,20 +984,39 @@ private[spark] class SparkSubmit extends Logging {
         e
     }
 
+    //    this variable is used to judge whether driver pod is normal if spark is on k8s
+    var DriverPodIsNormal: Boolean = if (args.master.startsWith("k8s")) true else false
+    var driverThrow: Throwable = null
     try {
       app.start(childArgs.toArray, sparkConf)
     } catch {
       case t: Throwable =>
-        throw findCause(t)
+        logWarning("Some ERR/Exception happened when app is running.")
+        if (args.master.startsWith("k8s")) {
+          DriverPodIsNormal = false
+          driverThrow = t
+        } else {
+          throw findCause(t)
+        }
     } finally {
       if (args.master.startsWith("k8s") && !isShell(args.primaryResource) &&
           !isSqlShell(args.mainClass) && !isThriftServer(args.mainClass) &&
           !isConnectServer(args.mainClass)) {
         try {
+          logWarning("Begin to close SparkContext inside driver pod......")
           SparkContext.getActive.foreach(_.stop())
         } catch {
           case e: Throwable => logError(s"Failed to close SparkContext: $e")
-        }
+        } finally {
+          if (SparkContext.getActive.isEmpty) {
+            logWarning("Finished to close SparkContext inside driver pod successfully.")
+            if (!DriverPodIsNormal) {
+              logError(s"Driver Pod will exit because: $driverThrow")
+              System.exit(1)
+            }
+          } else {
+            logWarning("Failed to close SparkContext.")

Review Comment:
   May I ask when does this happen actually? And, what was the previous behavior?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "littlelittlewhite09 (via GitHub)" <gi...@apache.org>.
littlelittlewhite09 commented on PR #45613:
URL: https://github.com/apache/spark/pull/45613#issuecomment-2012636031

   > I checked your example code in the JIRA and it looks weird. Ideally, you should stop the non-daemon threads for a JVM to exit, and stop SparkContext to terminate the app
   If non-daemon threads is not stopped on yarn-cluster, driver can stop. But it does not work on k8s. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45613:
URL: https://github.com/apache/spark/pull/45613#discussion_r1532183713


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -984,20 +984,39 @@ private[spark] class SparkSubmit extends Logging {
         e
     }
 
+    //    this variable is used to judge whether driver pod is normal if spark is on k8s
+    var DriverPodIsNormal: Boolean = if (args.master.startsWith("k8s")) true else false
+    var driverThrow: Throwable = null
     try {
       app.start(childArgs.toArray, sparkConf)
     } catch {
       case t: Throwable =>
-        throw findCause(t)
+        logWarning("Some ERR/Exception happened when app is running.")
+        if (args.master.startsWith("k8s")) {
+          DriverPodIsNormal = false
+          driverThrow = t
+        } else {
+          throw findCause(t)
+        }
     } finally {
       if (args.master.startsWith("k8s") && !isShell(args.primaryResource) &&
           !isSqlShell(args.mainClass) && !isThriftServer(args.mainClass) &&
           !isConnectServer(args.mainClass)) {
         try {
+          logWarning("Begin to close SparkContext inside driver pod......")
           SparkContext.getActive.foreach(_.stop())
         } catch {
           case e: Throwable => logError(s"Failed to close SparkContext: $e")
-        }
+        } finally {
+          if (SparkContext.getActive.isEmpty) {
+            logWarning("Finished to close SparkContext inside driver pod successfully.")
+            if (!DriverPodIsNormal) {
+              logError(s"Driver Pod will exit because: $driverThrow")
+              System.exit(1)

Review Comment:
   Please use the official `SparkExitCode`.
   
   https://github.com/apache/spark/blob/a3c04ec1145662e4227d57cd953bffce96b8aad7/core/src/main/scala/org/apache/spark/util/SparkExitCode.scala#L23



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45613:
URL: https://github.com/apache/spark/pull/45613#discussion_r1532193579


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -984,20 +984,39 @@ private[spark] class SparkSubmit extends Logging {
         e
     }
 
+    //    this variable is used to judge whether driver pod is normal if spark is on k8s
+    var DriverPodIsNormal: Boolean = if (args.master.startsWith("k8s")) true else false
+    var driverThrow: Throwable = null
     try {
       app.start(childArgs.toArray, sparkConf)
     } catch {
       case t: Throwable =>
-        throw findCause(t)
+        logWarning("Some ERR/Exception happened when app is running.")
+        if (args.master.startsWith("k8s")) {
+          DriverPodIsNormal = false
+          driverThrow = t
+        } else {
+          throw findCause(t)
+        }
     } finally {
       if (args.master.startsWith("k8s") && !isShell(args.primaryResource) &&
           !isSqlShell(args.mainClass) && !isThriftServer(args.mainClass) &&
           !isConnectServer(args.mainClass)) {
         try {
+          logWarning("Begin to close SparkContext inside driver pod......")
           SparkContext.getActive.foreach(_.stop())
         } catch {
           case e: Throwable => logError(s"Failed to close SparkContext: $e")
-        }
+        } finally {
+          if (SparkContext.getActive.isEmpty) {
+            logWarning("Finished to close SparkContext inside driver pod successfully.")

Review Comment:
   This extra warning message is a regression in a healthy Spark K8s jobs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45613:
URL: https://github.com/apache/spark/pull/45613#discussion_r1532189265


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -984,20 +984,39 @@ private[spark] class SparkSubmit extends Logging {
         e
     }
 
+    //    this variable is used to judge whether driver pod is normal if spark is on k8s
+    var DriverPodIsNormal: Boolean = if (args.master.startsWith("k8s")) true else false
+    var driverThrow: Throwable = null
     try {
       app.start(childArgs.toArray, sparkConf)
     } catch {
       case t: Throwable =>
-        throw findCause(t)
+        logWarning("Some ERR/Exception happened when app is running.")
+        if (args.master.startsWith("k8s")) {
+          DriverPodIsNormal = false
+          driverThrow = t
+        } else {
+          throw findCause(t)
+        }
     } finally {
       if (args.master.startsWith("k8s") && !isShell(args.primaryResource) &&
           !isSqlShell(args.mainClass) && !isThriftServer(args.mainClass) &&
           !isConnectServer(args.mainClass)) {
         try {
+          logWarning("Begin to close SparkContext inside driver pod......")

Review Comment:
   This extra warning message is a regression in a healthy Spark K8s jobs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45613:
URL: https://github.com/apache/spark/pull/45613#discussion_r1532191319


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -984,20 +984,39 @@ private[spark] class SparkSubmit extends Logging {
         e
     }
 
+    //    this variable is used to judge whether driver pod is normal if spark is on k8s

Review Comment:
   Extra spaces, `//      this`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on PR #45613:
URL: https://github.com/apache/spark/pull/45613#issuecomment-2011072653

   I checked your example code in the JIRA and it looks weird. Ideally, you should stop the non-daemon threads for a JVM to exit, and stop SparkContext to terminate the app


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org