You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2022/01/25 22:36:13 UTC
[spark] branch master updated: [SPARK-38023][CORE] `ExecutorMonitor.onExecutorRemoved` should handle `ExecutorDecommission` as finished
This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 9887d0f [SPARK-38023][CORE] `ExecutorMonitor.onExecutorRemoved` should handle `ExecutorDecommission` as finished
9887d0f is described below
commit 9887d0f7f55157da1b9f55d7053cc6c78ea3cdc5
Author: Dongjoon Hyun <do...@apache.org>
AuthorDate: Tue Jan 25 14:34:56 2022 -0800
[SPARK-38023][CORE] `ExecutorMonitor.onExecutorRemoved` should handle `ExecutorDecommission` as finished
### What changes were proposed in this pull request?
Although SPARK-36614 (https://github.com/apache/spark/pull/33868) fixed the UI issue, it made a regression where the `K8s integration test` has been broken and shows a wrong metrics and message to the users. After `Finished decommissioning`, it's still counted it as `unfinished`. This PR aims to fix this bug.
**BEFORE**
```
22/01/25 13:05:16 DEBUG KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:
Asked to remove executor 1 with reason Finished decommissioning
...
22/01/25 13:05:16 INFO ExecutorMonitor: Executor 1 is removed.
Remove reason statistics: (gracefully decommissioned: 0, decommision unfinished: 1, driver killed: 0, unexpectedly exited: 0).
```
**AFTER**
```
Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
```
### Why are the changes needed?
```
$ build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
```
**BEFORE**
The corresponding test case hangs and fails.
```
[info] KubernetesSuite:
...
[info] *** Test still running after 2 minutes, 13 seconds: suite name: KubernetesSuite, test name: Test decommissioning with dynamic allocation & shuffle cleanups.
// Eventually fails
...
```
**AFTER**
```
[info] KubernetesSuite:
...
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 41 seconds)
...
```
### Does this PR introduce _any_ user-facing change?
Yes, this is a regression bug fix.
### How was this patch tested?
Manually because this should be verified via the K8s integration test
```
$ build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
```
Closes #35321 from dongjoon-hyun/SPARK-38023.
Authored-by: Dongjoon Hyun <do...@apache.org>
Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
.../scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala | 3 ++-
.../apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala b/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
index 3dea64c..def63b9 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
@@ -356,7 +356,8 @@ private[spark] class ExecutorMonitor(
if (removed != null) {
decrementExecResourceProfileCount(removed.resourceProfileId)
if (removed.decommissioning) {
- if (event.reason == ExecutorLossMessage.decommissionFinished) {
+ if (event.reason == ExecutorLossMessage.decommissionFinished ||
+ event.reason == ExecutorDecommission().message) {
metrics.gracefullyDecommissioned.inc()
} else {
metrics.decommissionUnfinished.inc()
diff --git a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala
index 9605f6c..ca6108d 100644
--- a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala
+++ b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala
@@ -151,7 +151,7 @@ private[spark] trait DecommissionSuite { k8sSuite: KubernetesSuite =>
val client = kubernetesTestComponents.kubernetesClient
// The label will be added eventually, but k8s objects don't refresh.
Eventually.eventually(
- PatienceConfiguration.Timeout(Span(1200, Seconds)),
+ PatienceConfiguration.Timeout(Span(120, Seconds)),
PatienceConfiguration.Interval(Span(1, Seconds))) {
val currentPod = client.pods().withName(pod.getMetadata.getName).get
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org