You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2022/01/25 22:36:13 UTC

[spark] branch master updated: [SPARK-38023][CORE] `ExecutorMonitor.onExecutorRemoved` should handle `ExecutorDecommission` as finished

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 9887d0f  [SPARK-38023][CORE] `ExecutorMonitor.onExecutorRemoved` should handle `ExecutorDecommission` as finished
9887d0f is described below

commit 9887d0f7f55157da1b9f55d7053cc6c78ea3cdc5
Author: Dongjoon Hyun <do...@apache.org>
AuthorDate: Tue Jan 25 14:34:56 2022 -0800

    [SPARK-38023][CORE] `ExecutorMonitor.onExecutorRemoved` should handle `ExecutorDecommission` as finished
    
    ### What changes were proposed in this pull request?
    
    Although SPARK-36614 (https://github.com/apache/spark/pull/33868) fixed the UI issue, it made a regression where the `K8s integration test` has been broken and shows a wrong metrics and message to the users. After `Finished decommissioning`, it's still counted it as `unfinished`. This PR aims to fix this bug.
    
    **BEFORE**
    ```
    22/01/25 13:05:16 DEBUG KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:
    Asked to remove executor 1 with reason Finished decommissioning
    ...
    22/01/25 13:05:16 INFO ExecutorMonitor: Executor 1 is removed.
    Remove reason statistics: (gracefully decommissioned: 0, decommision unfinished: 1, driver killed: 0, unexpectedly exited: 0).
    ```
    
    **AFTER**
    ```
    Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
    ```
    
    ### Why are the changes needed?
    
    ```
    $ build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
    ```
    
    **BEFORE**
    The corresponding test case hangs and fails.
    ```
    [info] KubernetesSuite:
    ...
    [info] *** Test still running after 2 minutes, 13 seconds: suite name: KubernetesSuite, test name: Test decommissioning with dynamic allocation & shuffle cleanups.
    // Eventually fails
    ...
    ```
    
    **AFTER**
    ```
    [info] KubernetesSuite:
    ...
    [info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 41 seconds)
    ...
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, this is a regression bug fix.
    
    ### How was this patch tested?
    
    Manually because this should be verified via the K8s integration test
    ```
    $ build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
    ```
    
    Closes #35321 from dongjoon-hyun/SPARK-38023.
    
    Authored-by: Dongjoon Hyun <do...@apache.org>
    Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
 .../scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala    | 3 ++-
 .../apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala    | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala b/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
index 3dea64c..def63b9 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
@@ -356,7 +356,8 @@ private[spark] class ExecutorMonitor(
     if (removed != null) {
       decrementExecResourceProfileCount(removed.resourceProfileId)
       if (removed.decommissioning) {
-        if (event.reason == ExecutorLossMessage.decommissionFinished) {
+        if (event.reason == ExecutorLossMessage.decommissionFinished ||
+            event.reason == ExecutorDecommission().message) {
           metrics.gracefullyDecommissioned.inc()
         } else {
           metrics.decommissionUnfinished.inc()
diff --git a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala
index 9605f6c..ca6108d 100644
--- a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala
+++ b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala
@@ -151,7 +151,7 @@ private[spark] trait DecommissionSuite { k8sSuite: KubernetesSuite =>
           val client = kubernetesTestComponents.kubernetesClient
           // The label will be added eventually, but k8s objects don't refresh.
           Eventually.eventually(
-            PatienceConfiguration.Timeout(Span(1200, Seconds)),
+            PatienceConfiguration.Timeout(Span(120, Seconds)),
             PatienceConfiguration.Interval(Span(1, Seconds))) {
 
             val currentPod = client.pods().withName(pod.getMetadata.getName).get

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org