You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/29 05:24:23 UTC

[GitHub] [spark] warrenzhu25 opened a new pull request, #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

warrenzhu25 opened a new pull request, #39280:
URL: https://github.com/apache/spark/pull/39280

   ### What changes were proposed in this pull request?
   Handle decommission request sent before executor registration
   
   ### Why are the changes needed?
   Current behavior is such requests will be ignored
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added test in CoarseGrainedSchedulerBackendSuite
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1516819804

   @dongjoon-hyun Could we merge this? It seems no response from @Ngone51 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1530390092

   Sorry, @warrenzhu25 . I was on vacation. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1071230700


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -534,11 +549,14 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
     // Do not change this code without running the K8s integration suites
     val executorsToDecommission = executorsAndDecomInfo.flatMap { case (executorId, decomInfo) =>
       // Only bother decommissioning executors which are alive.
+      // Keep executor decommission info in case executor started, but not registered yet
       if (isExecutorActive(executorId)) {
         scheduler.executorDecommission(executorId, decomInfo)
         executorsPendingDecommission(executorId) = decomInfo
         Some(executorId)
       } else {
+        unKnownExecutorsPendingDecommission.put(executorId,

Review Comment:
   My concern is actually the race condition between "executor killed by driver" and "executor decommissioned". If "executor killed by driver" happens right before "executor decommissioned", then we'd mistakenly put the executor into `unKnownExecutorsPendingDecommission`. But yes, the cache is bounded as well as this only a race condition, so I think it should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1064192028


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -102,6 +104,15 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
   // Executors which are being decommissioned. Maps from executorId to ExecutorDecommissionInfo.
   protected val executorsPendingDecommission = new HashMap[String, ExecutorDecommissionInfo]
 
+  // Unknown Executors which are being decommissioned. This could be caused by unregistered executor
+  // This executor should be decommissioned after registration.
+  // Maps from executorId to (ExecutorDecommissionInfo, adjustTargetNumExecutors,
+  // triggeredByExecutor).
+  protected val unKnownExecutorsPendingDecommission =
+    CacheBuilder.newBuilder()
+      .maximumSize(1000)

Review Comment:
   In an cluster with spot VM, the number of such executors could be large. What's your suggestion for this? Making this smaller or making it configurable? I plan to keep them at most 5 mins.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1530386238

   @mridulm Could you help merge this. It seems no response from @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1367694180

   > > > May I ask what is the benefit of handling an executor who did nothing yet, @warrenzhu25 ?
   > > > > Handle decommission request sent before executor registration
   > > 
   > > 
   > > When decom request is triggered from worker, executor should be decommissioned and not accept new tasks, but if scheduler backend got decom request before executor registration, the executor won't be decommissioned.
   > 
   > Could you elaborate about the real situation you hit? If the spot instances or K8s knodes are in the process of terminations, it seems that we are going to lose those underlying machines anyway, doesn't it?
   
   Image the below senario:
   1. Executor 1 launched but not resigstered.
   2. Node got decom signal, request to driver to decom executor 1. Node will terminate in 1 min.
   3. If decom request is ignored by driver, driver will schedule new tasks on executor 1. After the fix, no new task will be scheduled after executor registration.
   4. 1 mins later, all running tasks on executor 1 will be lost.
   
   In our senario, node will be terminated only after all executor are exit. When decom request is missed, executor won't exit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
mridulm commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1064086546


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -298,6 +309,10 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
           listenerBus.post(
             SparkListenerExecutorAdded(System.currentTimeMillis(), executorId, data))
           // Note: some tests expect the reply to come after we put the executor in the map
+          // Decommission executor whose request received before registration
+          Option(unKnownExecutorsPendingDecommission.getIfPresent(executorId))
+            .foreach(v => decommissionExecutors(Array((executorId, v._1)), v._2, v._3))

Review Comment:
   Also, invalidate `executorId` from the cache in the `foreach`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1368223726

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1499343023

   Sorry for being away, @warrenzhu25 and @mridulm . I'll take a look again Today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1530392797

   Thank you, @warrenzhu25 , @mridulm , @Ngone51 . 
   Merged to master for Apache Spark 3.5.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1160035944


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -102,6 +103,15 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
   // Executors which are being decommissioned. Maps from executorId to ExecutorDecommissionInfo.
   protected val executorsPendingDecommission = new HashMap[String, ExecutorDecommissionInfo]
 
+  // Unknown Executors which are being decommissioned. This could be caused by unregistered executor
+  // This executor should be decommissioned after registration.
+  // Maps from executorId to (ExecutorDecommissionInfo, adjustTargetNumExecutors,
+  // triggeredByExecutor).
+  protected val unknownExecutorsPendingDecommission =
+    CacheBuilder.newBuilder()
+      .maximumSize(1000)

Review Comment:
   Please add an official configuration for this magic number. Can we start with `0` safely (by disabling this features), @warrenzhu25 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1160945224


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -2242,6 +2242,16 @@ package object config {
       .checkValue(_ >= 0, "needs to be a non-negative value")
       .createWithDefault(0)
 
+  private[spark] val SCHEDULER_MAX_RETAINED_UNKNOWN_EXECUTORS =
+    ConfigBuilder("spark.scheduler.maxRetainedUnknownDecommissionExecutors")
+      .internal()
+      .doc("Max number of unknown executors by decommission to retain. This affects " +
+        "whether executor could receive decommission request sent before its registration.")
+      .version("3.4.0")

Review Comment:
   `3.5.0`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1160949314


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -2242,6 +2242,16 @@ package object config {
       .checkValue(_ >= 0, "needs to be a non-negative value")
       .createWithDefault(0)
 
+  private[spark] val SCHEDULER_MAX_RETAINED_UNKNOWN_EXECUTORS =
+    ConfigBuilder("spark.scheduler.maxRetainedUnknownDecommissionExecutors")
+      .internal()
+      .doc("Max number of unknown executors by decommission to retain. This affects " +
+        "whether executor could receive decommission request sent before its registration.")
+      .version("3.4.0")

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1499242189

   @dongjoon-hyun @mridulm @Ngone51 Help take a look again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1367594432

   > May I ask what is the benefit of handling an executor who did nothing yet, @warrenzhu25 ?
   > 
   > > Handle decommission request sent before executor registration
   
   When decom request is triggered from worker, executor should be decommissioned and not accept new tasks, but if scheduler backend got decom request before executor registration, the executor won't be decommissioned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1367081016

   @dongjoon-hyun @mridulm @Ngone51 Help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1059223633


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -53,6 +53,8 @@ private[spark]
 class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: RpcEnv)
   extends ExecutorAllocationClient with SchedulerBackend with Logging {
 
+  import com.google.common.cache.CacheBuilder

Review Comment:
   Please move this to the `import` header part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1367687995

   > > May I ask what is the benefit of handling an executor who did nothing yet, @warrenzhu25 ?
   > > > Handle decommission request sent before executor registration
   > 
   > When decom request is triggered from worker, executor should be decommissioned and not accept new tasks, but if scheduler backend got decom request before executor registration, the executor won't be decommissioned.
   
   Could you elaborate about the real situation you hit? If the spot instances or K8s knodes are in the process of terminations, it seems that we are going to lose those underlying machines anyway, doesn't it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1059223400


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -102,6 +104,15 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
   // Executors which are being decommissioned. Maps from executorId to ExecutorDecommissionInfo.
   protected val executorsPendingDecommission = new HashMap[String, ExecutorDecommissionInfo]
 
+  // Unknown Executors which are being decommissioned. This could be caused by unregistered executor
+  // This executor should be decommissioned after registration.
+  // Maps from executorId to (ExecutorDecommissionInfo, adjustTargetNumExecutors,
+  // triggeredByExecutor).
+  protected val unKnownExecutorsPendingDecommission =
+    CacheBuilder.newBuilder()
+      .maximumSize(1000)

Review Comment:
   This looks like a random magic number. Do you mean that one `CoarseGrainedSchedulerBackend` can have many un-registered executors like `1000`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1059223443


##########
core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala:
##########
@@ -403,6 +404,30 @@ class CoarseGrainedSchedulerBackendSuite extends SparkFunSuite with LocalSparkCo
       "Our unexpected executor does not have a request time.")
   }
 
+  test("New registered executor should receive decommission request sent before registration") {
+

Review Comment:
   Redundant empty line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1499306450

   The PR looks reasonable to me - but I would like either @dongjoon-hyun or @Ngone51 to also take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration
URL: https://github.com/apache/spark/pull/39280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
mridulm commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1064086141


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -534,11 +549,14 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
     // Do not change this code without running the K8s integration suites
     val executorsToDecommission = executorsAndDecomInfo.flatMap { case (executorId, decomInfo) =>
       // Only bother decommissioning executors which are alive.
+      // Keep executor decommission info in case executor started, but not registered yet
       if (isExecutorActive(executorId)) {
         scheduler.executorDecommission(executorId, decomInfo)
         executorsPendingDecommission(executorId) = decomInfo
         Some(executorId)
       } else {
+        unKnownExecutorsPendingDecommission.put(executorId,

Review Comment:
   Is the concern that the entry in the map will be around for a registration request to come in, that never happens ?
   If yes, this can also happen when the executor startup fails - for example.
   
   This should be fine, given the cache is bounded, right ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1059223050


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -102,6 +104,15 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
   // Executors which are being decommissioned. Maps from executorId to ExecutorDecommissionInfo.
   protected val executorsPendingDecommission = new HashMap[String, ExecutorDecommissionInfo]
 
+  // Unknown Executors which are being decommissioned. This could be caused by unregistered executor
+  // This executor should be decommissioned after registration.
+  // Maps from executorId to (ExecutorDecommissionInfo, adjustTargetNumExecutors,
+  // triggeredByExecutor).
+  protected val unKnownExecutorsPendingDecommission =

Review Comment:
   `unKnown` -> `unknown`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1060301018


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -534,11 +549,14 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
     // Do not change this code without running the K8s integration suites
     val executorsToDecommission = executorsAndDecomInfo.flatMap { case (executorId, decomInfo) =>
       // Only bother decommissioning executors which are alive.
+      // Keep executor decommission info in case executor started, but not registered yet
       if (isExecutorActive(executorId)) {
         scheduler.executorDecommission(executorId, decomInfo)
         executorsPendingDecommission(executorId) = decomInfo
         Some(executorId)
       } else {
+        unKnownExecutorsPendingDecommission.put(executorId,

Review Comment:
   What if the executor is actually killed by the driver right before the decom request comes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1160035944


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -102,6 +103,15 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
   // Executors which are being decommissioned. Maps from executorId to ExecutorDecommissionInfo.
   protected val executorsPendingDecommission = new HashMap[String, ExecutorDecommissionInfo]
 
+  // Unknown Executors which are being decommissioned. This could be caused by unregistered executor
+  // This executor should be decommissioned after registration.
+  // Maps from executorId to (ExecutorDecommissionInfo, adjustTargetNumExecutors,
+  // triggeredByExecutor).
+  protected val unknownExecutorsPendingDecommission =
+    CacheBuilder.newBuilder()
+      .maximumSize(1000)

Review Comment:
   Please add an official configuration for this magic number and can we start with `0` safely (by disabling this features), @warrenzhu25 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #39280:
URL: https://github.com/apache/spark/pull/39280#issuecomment-1500617216

   Gentle ping @Ngone51 once more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on code in PR #39280:
URL: https://github.com/apache/spark/pull/39280#discussion_r1160765918


##########
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##########
@@ -102,6 +103,15 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
   // Executors which are being decommissioned. Maps from executorId to ExecutorDecommissionInfo.
   protected val executorsPendingDecommission = new HashMap[String, ExecutorDecommissionInfo]
 
+  // Unknown Executors which are being decommissioned. This could be caused by unregistered executor
+  // This executor should be decommissioned after registration.
+  // Maps from executorId to (ExecutorDecommissionInfo, adjustTargetNumExecutors,
+  // triggeredByExecutor).
+  protected val unknownExecutorsPendingDecommission =
+    CacheBuilder.newBuilder()
+      .maximumSize(1000)

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org