You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/05 07:44:45 UTC

[GitHub] [spark] Dam1029 opened a new pull request, #38518: Reset the executor pods watcher when we receive a version changed fro…

Dam1029 opened a new pull request, #38518:
URL: https://github.com/apache/spark/pull/38518

   ### What changes were proposed in this pull request?
   This is a straight application of #33349 onto master, reset executor pods watcher when we encounter too old resource version (https://issues.apache.org/jira/browse/SPARK-33349)
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Running spark-submit to a k8s cluster
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a diff in pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on code in PR #38518:
URL: https://github.com/apache/spark/pull/38518#discussion_r1143681244


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala:
##########
@@ -45,7 +46,9 @@ class ExecutorPodsWatchSnapshotSource(
     conf: SparkConf) extends Logging {
 
   private var watchConnection: Closeable = _

Review Comment:
   The `Watcher` is running on a separate thread, right? And `watchConnection` sometimes accessed from one thread via the `ExecutorPodsWatchSnapshotSource#stop()` and sometimes from the watcher running thread via the `ExecutorPodsWatchSnapshotSource#reset()`. So those methods where `watchConnection` is accessed start(), stop() and reset() must be synchronized.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1613947410

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] simonvanderveldt commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "simonvanderveldt (via GitHub)" <gi...@apache.org>.
simonvanderveldt commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1615031055

   Please stop using these useless stale bots. Issues and PRs should be closed when they are fixed/merged, the passing of time doesn't magically fix anything.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Dam1029 commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "Dam1029 (via GitHub)" <gi...@apache.org>.
Dam1029 commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1426635994

   > To @Dam1029 ,
   > 
   > * Could you anwer @holdenk 's comment (https://github.com/apache/spark/pull/38518/files#r1036510431)?
   > * To be safe, could you revisie this PR by adding a new internal configuration like `KUBERNETES_EXECUTOR_ENABLE_API_WATCHER` and use it with the following?
   > 
   > ```scala
   > if (e != null && e.isHttpGone) {
   > ```
   
   1. Yes, I have answered the comment. When ExecutorPodsWatcher onClose function meet HttpGone, it mean the executor watcher clinent is too old resource version in k8s, and should be reset the connection.  It will not appear infinite loop, and I think it is no need to set max resets in K minutes counter.
   2. I have add a new internal configuration : spark.kubernetes.executor.enableApiWatcherAutoReset, so that add the ability to selectively disable executor watcher connection auto reset. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1401531745

   May I ask which K8s fabric kubernetes-client version are used, @yeachan153 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] simonvanderveldt commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "simonvanderveldt (via GitHub)" <gi...@apache.org>.
simonvanderveldt commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1478291017

   > To be safe, could you revisie this PR by adding a new internal configuration like KUBERNETES_EXECUTOR_ENABLE_API_WATCHER and use it with the following?
   
   Not sure this makes sense. AFAIK the current behavior is a bug, so it should just be fixed.
   If there's uncertainty about the implementation of this fix then IMHO this uncertainty should be addressed.
   For reference, the patch as it was before the config option was added is working fine for us, we run about 800 Spark apps a day.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1304497079

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yeachan153 commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "yeachan153 (via GitHub)" <gi...@apache.org>.
yeachan153 commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1401962554

   > May I ask which K8s fabric kubernetes-client version are used, @yeachan153 ?
   
   We are using 5.4.1 @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1403233118

   Thank you for confirming, @yeshengm . Since Apache Spark 3.4 is using `6.4.0`, it seems that we need some validations on `6.4.0`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] simonvanderveldt commented on a diff in pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by GitBox <gi...@apache.org>.
simonvanderveldt commented on code in PR #38518:
URL: https://github.com/apache/spark/pull/38518#discussion_r1069112763


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala:
##########
@@ -78,6 +80,15 @@ class ExecutorPodsWatchSnapshotSource(
     }
   }
 
+  def reset(watcher: ExecutorPodsWatcher): Unit = {

Review Comment:
   @sharkdtu What kind of exception would you expect? We're running this patch at the moment and can check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a diff in pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by GitBox <gi...@apache.org>.
holdenk commented on code in PR #38518:
URL: https://github.com/apache/spark/pull/38518#discussion_r1036510431


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala:
##########
@@ -86,8 +97,14 @@ class ExecutorPodsWatchSnapshotSource(
     }
 
     override def onClose(e: WatcherException): Unit = {
-      logWarning("Kubernetes client has been closed (this is expected if the application is" +
-        " shutting down.)", e)
+      if (e != null && e.isHttpGone) {

Review Comment:
   I'm _slightly_ worried about an infinite loop here, I don't see any clear doc on when isHttpGone is set true in fabric8, can we have a max resets in K minutes counter or something similar?
   
   Or if there is good docs on the behaviour of isHttpGone that would be good too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a diff in pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on code in PR #38518:
URL: https://github.com/apache/spark/pull/38518#discussion_r1143613836


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala:
##########
@@ -86,8 +97,14 @@ class ExecutorPodsWatchSnapshotSource(
     }
 
     override def onClose(e: WatcherException): Unit = {
-      logWarning("Kubernetes client has been closed (this is expected if the application is" +
-        " shutting down.)", e)
+      if (e != null && e.isHttpGone) {

Review Comment:
   I think we can accept the isHttpGone check as kubernetes-client does the exact same here:
   
   https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java#L288-L294



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Dam1029 commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by GitBox <gi...@apache.org>.
Dam1029 commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1312350979

   @dongjoon-hyun @Ngone51 Could you help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s
URL: https://github.com/apache/spark/pull/38518


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sharkdtu commented on a diff in pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

Posted by GitBox <gi...@apache.org>.
sharkdtu commented on code in PR #38518:
URL: https://github.com/apache/spark/pull/38518#discussion_r1042926171


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala:
##########
@@ -78,6 +80,15 @@ class ExecutorPodsWatchSnapshotSource(
     }
   }
 
+  def reset(watcher: ExecutorPodsWatcher): Unit = {

Review Comment:
   Have you ever test it?
   The problem is that kubernetes client has been closed. reset may throw an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org