You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "littlelittlewhite09 (via GitHub)" <gi...@apache.org> on 2024/03/22 12:44:13 UTC

[PR] fix driver pod stuck when driver on k8s [spark]

littlelittlewhite09 opened a new pull request, #45667:
URL: https://github.com/apache/spark/pull/45667

   ### What changes were proposed in this pull request?
   This pr is related to [SPARK-47488](https://issues.apache.org/jira/browse/SPARK-47488)
   The idea is that, when driver encounters exception, driver will be terminated until sparkcontext is closed if spark runs on k8s,   regardless of whether the non-daemon threads stop or not.
   
   
   
   
   
   ### Why are the changes needed?
   We are migrating spark app on yarn-cluster mode to k8s. When spark app runs on yarn-cluster mode, everything is ok. Driver can terminate normally if encounters exception or err. But running on k8s, driver pod may get stuck when encounters exception even if sparkcontext is closed. We found this problem is caused by non-daemon threads not stopped. On yarn-cluster mode, even if non-daemon thread is not stopped, driver can still stop.
   This pr may benefit to make the migration from yarn cluster mode to k8s smoother. 
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   UT
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45667:
URL: https://github.com/apache/spark/pull/45667#issuecomment-2016642717

   If this is YARN's esoteric feature, we had better not do this, @littlelittlewhite09 , because Standalone and K8s are consistent.
   > I am sorry that I am not familiar with spark on standalone, and we do not use standalone as resource manager.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45667:
URL: https://github.com/apache/spark/pull/45667#issuecomment-2016640612

   Thank you for sharing, @littlelittlewhite09 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "littlelittlewhite09 (via GitHub)" <gi...@apache.org>.
littlelittlewhite09 commented on PR #45667:
URL: https://github.com/apache/spark/pull/45667#issuecomment-2015185406

   @dongjoon-hyun Yes. This PR is re-created. wildfly-openssl is ok. I made an incorrect commit, which caused code to be in a mess, so I decide to re-create this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "littlelittlewhite09 (via GitHub)" <gi...@apache.org>.
littlelittlewhite09 commented on PR #45667:
URL: https://github.com/apache/spark/pull/45667#issuecomment-2016560862

   > We have unresolved review comments from the previous PR.
   > 
   > 1. [[SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down #45613 (review)](https://github.com/apache/spark/pull/45613#pullrequestreview-1949055273)
   > 2. [[SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down #45613 (comment)](https://github.com/apache/spark/pull/45613#issuecomment-2009763812)
   
   1. I tried to follow this doc to test Kubernetes Integration tests, but driver pod always not be launched. I do not how to fix it, because I am not familiar with k8s. Meanwhile, I tried to use workflow to test, and it seems it passed k8s test when my master branch  merged into this PR. Here is the result. https://github.com/littlelittlewhite09/spark/actions/runs/8403080524/job/23013202288
   2. I am sorry that I am not familiar with spark on `standalone`, and we do not use `standalone` as resource manager.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45667:
URL: https://github.com/apache/spark/pull/45667#discussion_r1536712746


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -996,6 +1005,13 @@ private[spark] class SparkSubmit extends Logging {
           SparkContext.getActive.foreach(_.stop())
         } catch {
           case e: Throwable => logError(s"Failed to close SparkContext: $e")
+        } finally {
+          if (SparkContext.getActive.isEmpty) {
+            if (!DriverPodIsNormal) {
+              logError(s"Driver Pod will exit because: $driverThrow")
+              System.exit(EXIT_FAILURE)

Review Comment:
   What is the YARN exit code for your case? According to your PR description, it's not clear, @littlelittlewhite09 . It would be described in the PR description.
   
   > When spark app runs on yarn-cluster mode, everything is ok. Driver can terminate normally if encounters exception or err. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "littlelittlewhite09 (via GitHub)" <gi...@apache.org>.
littlelittlewhite09 commented on code in PR #45667:
URL: https://github.com/apache/spark/pull/45667#discussion_r1536671930


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -983,11 +984,19 @@ private[spark] class SparkSubmit extends Logging {
         e
     }
 
+    var DriverPodIsNormal: Boolean = if (args.master.startsWith("k8s")) true else false
+    var driverThrow: Throwable = null
     try {
       app.start(childArgs.toArray, sparkConf)
     } catch {
       case t: Throwable =>
-        throw findCause(t)
+        logWarning("Some ERR/Exception happened when app is running.")
+        if (args.master.startsWith("k8s")) {

Review Comment:
   Yes, `DriverPodIsNormal` applied here may be better.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45667:
URL: https://github.com/apache/spark/pull/45667#discussion_r1535795333


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -996,6 +1005,13 @@ private[spark] class SparkSubmit extends Logging {
           SparkContext.getActive.foreach(_.stop())
         } catch {
           case e: Throwable => logError(s"Failed to close SparkContext: $e")
+        } finally {
+          if (SparkContext.getActive.isEmpty) {
+            if (!DriverPodIsNormal) {
+              logError(s"Driver Pod will exit because: $driverThrow")
+              System.exit(EXIT_FAILURE)

Review Comment:
   Could you provide the YARN code link for this case? Specifically,
   1. Does YARN job fail with EXIT_FAILURE ?
   2. If this PR is for consistency between YARN and K8s,  we should have the same exit code and same error message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45667:
URL: https://github.com/apache/spark/pull/45667#discussion_r1535796652


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -983,11 +984,19 @@ private[spark] class SparkSubmit extends Logging {
         e
     }
 
+    var DriverPodIsNormal: Boolean = if (args.master.startsWith("k8s")) true else false
+    var driverThrow: Throwable = null
     try {
       app.start(childArgs.toArray, sparkConf)
     } catch {
       case t: Throwable =>
-        throw findCause(t)
+        logWarning("Some ERR/Exception happened when app is running.")
+        if (args.master.startsWith("k8s")) {

Review Comment:
   Did you want to use `DriverPodIsNormal` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "littlelittlewhite09 (via GitHub)" <gi...@apache.org>.
littlelittlewhite09 commented on code in PR #45667:
URL: https://github.com/apache/spark/pull/45667#discussion_r1536671147


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -996,6 +1005,13 @@ private[spark] class SparkSubmit extends Logging {
           SparkContext.getActive.foreach(_.stop())
         } catch {
           case e: Throwable => logError(s"Failed to close SparkContext: $e")
+        } finally {
+          if (SparkContext.getActive.isEmpty) {
+            if (!DriverPodIsNormal) {
+              logError(s"Driver Pod will exit because: $driverThrow")
+              System.exit(EXIT_FAILURE)

Review Comment:
   Here is yarn exit code.
   [resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala](https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L870)
   It seems applicationmaster adopts it's own failure exit code, which not suitable for driver pod.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47488][k8s]fix driver pod stuck when driver on k8s [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45667:
URL: https://github.com/apache/spark/pull/45667#discussion_r1536712423


##########
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##########
@@ -983,11 +984,19 @@ private[spark] class SparkSubmit extends Logging {
         e
     }
 
+    var DriverPodIsNormal: Boolean = if (args.master.startsWith("k8s")) true else false
+    var driverThrow: Throwable = null
     try {
       app.start(childArgs.toArray, sparkConf)
     } catch {
       case t: Throwable =>
-        throw findCause(t)
+        logWarning("Some ERR/Exception happened when app is running.")
+        if (args.master.startsWith("k8s")) {

Review Comment:
   Then, please use it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org