You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/07 19:20:47 UTC

[GitHub] [spark] attilapiros opened a new pull request #31513: [WIP][SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend

attilapiros opened a new pull request #31513:
URL: https://github.com/apache/spark/pull/31513


   
   ### What changes were proposed in this pull request?
   
   This PR modifies the POD allocator to use the scheduler backend to get the known executors and remove those from the pending and newly created list.
   
   ### Why are the changes needed?
   
   Because there is race between executor POD allocator and cluster scheduler backend.
   Running several experiment during downscaling we experienced a lot of killed fresh executors wich has already running task on them.
   
   The pattern in the log was the following (see executor 312 and TID 2079):
   
   ```
   21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new total is 138)
   ...
   21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID 2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes)
   21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests (408,312,307).
   ...
   21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 100.100.18.138: The executor with id 312 was deleted by a user or the framework.
   21/02/01 15:12:04 INFO TaskSetManager: Task 2079 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   #### Manually
   
   With this change there was no executor lost with running task on it.  
   
   ##### With unit test
   
   An existing test  is modified to check this case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #31513:
URL: https://github.com/apache/spark/pull/31513#discussion_r582672797



##########
File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
##########
@@ -164,15 +180,16 @@ private[spark] class ExecutorPodsAllocator(
       _deletedExecutorIds = _deletedExecutorIds.filter(existingExecs.contains)
     }
 
+    val notDeletedPods = lastSnapshot.executorPods.filterKeys(!_deletedExecutorIds.contains(_))

Review comment:
       This fixes an extra bug I found during testing. Without this change the 5.) point in my new test would fail as those pending pods which requested to be deleted earlier by the executor PODs allocator counted as available pending PODs for this resource profile (into the `currentPendingExecutorsForRpId`) and upscale is not triggered at:
   
   ```
   if (knownPodCount > targetNum) {
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-789645938


   Now I have it and updated the PR description of https://github.com/apache/spark/pull/31719 with my findings.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785778921


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40044/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780611933


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135198/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-774756687


   **[Test build #134992 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134992/testReport)** for PR 31513 at commit [`fbee783`](https://github.com/apache/spark/commit/fbee7834108c3c65cecac8f65d1159d736e9071a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-775930495


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39653/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780575177


   **[Test build #135198 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135198/testReport)** for PR 31513 at commit [`a316243`](https://github.com/apache/spark/commit/a316243c7b7c01b0cd95136b236daf30dd65becf).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785789070


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135464/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785826186


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40044/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-778101374


   **[Test build #135122 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135122/testReport)** for PR 31513 at commit [`44392cd`](https://github.com/apache/spark/commit/44392cde7130c36e9898902b31591e79077e1bbe).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-774768857






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785789070


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135464/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [WIP][SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-774750499


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39575/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-774768857






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-778094454


   **[Test build #135122 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135122/testReport)** for PR 31513 at commit [`44392cd`](https://github.com/apache/spark/commit/44392cde7130c36e9898902b31591e79077e1bbe).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785753915


   **[Test build #135464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135464/testReport)** for PR 31513 at commit [`bb578d2`](https://github.com/apache/spark/commit/bb578d296e77af20b4c3a006ac0349a84630fdad).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785826186


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40044/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785248802


   I come up with a good test (but my work is still ongoing):
   ```
       // N-: newly created scheduler backend unknown
       // N+: newly created scheduler backend known
       // P- / P+ : pending
       // D: deleted
       //                                      |   default    ||         rp        |
       //                                      |              ||                   |
       //                                      | 1  | 2  | 3  || 4  | 5  | 6  | 7  |
       //--------------------------------------------------------------------------
       // 0) setTotalExpectedExecs with        | N- | N- | N- || N- | N- | N- | N- | so
       //      default->3, ro->4               |    |    |    ||    |    |    |    | outstanding == 7
       //---------------------------------------------------------------------------
       // 1) make 1 from each rp               | N+ | N- | N- || N+ | N- | N- | N- |
       //    known by backend                  |    |    |    ||    |    |    |    | outstanding == 5
       //---------------------------------------------------------------------------
       // 2) some more backend known + pending | N+ | P+ | P- || N+ | P+ | P- | N- | outstanding == 3
       //---------------------------------------------------------------------------
       // 3) advance time with idle timeout    |    |    |    ||    |    |    |    |
       //    setTotalExpectedExecs with        | N+ | P+ | D  || N+ | P+ | D  | D  | outstanding == 0
       //      default->1, rp->1               |    |    |    ||    |    |    |    |
       //---------------------------------------------------------------------------
       // 4) setTotalExpectedExecs with        | N+ | P+ | D  || N+ | P+ | D  | D  | outstanding == 0
       //      default->2, rp->2               |    |    |    ||    |    |    |    | no new POD req.
       //
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-789435672


   The PR is created: https://github.com/apache/spark/pull/31719


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-779821134


   cc @HeartSaVioR 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] asfgit closed pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #31513:
URL: https://github.com/apache/spark/pull/31513


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-776338478


   @dongjoon-hyun 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-774754704


   **[Test build #134992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134992/testReport)** for PR 31513 at commit [`fbee783`](https://github.com/apache/spark/commit/fbee7834108c3c65cecac8f65d1159d736e9071a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros edited a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros edited a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785248802


   I come up with a good test (but my work is still ongoing):
   ```
       // N-: newly created not known by the scheduler backend
       // N+: newly created scheduler backend known
       // P- / P+ : pending
       // D: deleted
       //                                      |   default    ||         rp        |
       //                                      |              ||                   |
       //                                      | 1  | 2  | 3  || 4  | 5  | 6  | 7  |
       //--------------------------------------------------------------------------
       // 0) setTotalExpectedExecs with        | N- | N- | N- || N- | N- | N- | N- | so
       //      default->3, ro->4               |    |    |    ||    |    |    |    | outstanding == 7
       //---------------------------------------------------------------------------
       // 1) make 1 from each rp               | N+ | N- | N- || N+ | N- | N- | N- |
       //    known by backend                  |    |    |    ||    |    |    |    | outstanding == 5
       //---------------------------------------------------------------------------
       // 2) some more backend known + pending | N+ | P+ | P- || N+ | P+ | P- | N- | outstanding == 3
       //---------------------------------------------------------------------------
       // 3) advance time with idle timeout    |    |    |    ||    |    |    |    |
       //    setTotalExpectedExecs with        | N+ | P+ | D  || N+ | P+ | D  | D  | outstanding == 0
       //      default->1, rp->1               |    |    |    ||    |    |    |    |
       //---------------------------------------------------------------------------
       // 4) setTotalExpectedExecs with        | N+ | P+ | D  || N+ | P+ | D  | D  | outstanding == 0
       //      default->2, rp->2               |    |    |    ||    |    |    |    | no new POD req.
       //
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-778115629


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135122/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-775876398


   **[Test build #135071 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135071/testReport)** for PR 31513 at commit [`6c0c6e9`](https://github.com/apache/spark/commit/6c0c6e923a4d454ad47ad2975b6514a40d86539d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #31513:
URL: https://github.com/apache/spark/pull/31513#discussion_r582672797



##########
File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
##########
@@ -164,15 +180,16 @@ private[spark] class ExecutorPodsAllocator(
       _deletedExecutorIds = _deletedExecutorIds.filter(existingExecs.contains)
     }
 
+    val notDeletedPods = lastSnapshot.executorPods.filterKeys(!_deletedExecutorIds.contains(_))

Review comment:
       This fixes an extra bug I found during testing. Without this change the 5.) point in my new test would fail as those pending pods which requested to be deleted earlier by the executor PODs allocator counted as available pending PODs for this resource profile (into the `currentPendingExecutorsForRpId`) and upscale is not triggered at:
   
   ```
   if (knownPodCount > targetNum) {
   }
   ```
   
   As `knownPodCount` contains the `currentPendingExecutorsForRpId.size`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785762300


   **[Test build #135464 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135464/testReport)** for PR 31513 at commit [`bb578d2`](https://github.com/apache/spark/commit/bb578d296e77af20b4c3a006ac0349a84630fdad).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-774756932


   @holdenk 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-778143549


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39703/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-789621726


   @holdenk 
   
   As I found [a similar case](https://github.com/attilapiros/spark/blob/546d2eb5d46813a14c7bd30113fb6bb038cdd2fc/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala#L332-L335), expecting creating resource request in a specific order for multiple resource profiles,  in an earlier test `SPARK-33288: multiple resource profiles`:
   ``` 
       podsAllocatorUnderTest.setTotalExpectedExecutors(Map(defaultProfile -> 1, rp -> 2))
       verify(podOperations).create(podWithAttachedContainerForId(1, defaultProfile.id))
       verify(podOperations).create(podWithAttachedContainerForId(2, rp.id))
       verify(podOperations).create(podWithAttachedContainerForId(3, rp.id))
   ```
   
   And nobody was complaining for flakiness for this old test. I started to wonder if these cases can really cause any flakiness or not. If there is a Map implementation behind for these small Maps which uses fix ordering we are fine here.
    
   And I really bumped into the problem by having many code changes to make upscaling more progressive (a possible next PR). 
   
   But to be on the safe side better to use deterministic ordering [here](https://github.com/attilapiros/spark/blob/bb578d296e77af20b4c3a006ac0349a84630fdad/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L201-L202
   ): 
   ```
       // The order we request executors for each ResourceProfile is not guaranteed.
       totalExpectedExecutorsPerResourceProfileId.asScala.foreach { case (rpId, targetNum) =>
   ```
   
    WDYT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-775898419


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135071/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785738931


   Rebased on top of the master and new test is added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-781427702


   @holdenk Even if they do not reach any known state they will be taken care by the `ExecutorPodsLifecycleManager`:
   https://github.com/apache/spark/blob/a316243c7b7c01b0cd95136b236daf30dd65becf/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L118-L147 
   
   I think `ExecutorPodsAllocator` responsibility are:
   1) allocating more resources for scaling up
   2) mitigating the risk of losing something useful during downscaling: by killing prefereby newly created or (if there is not enough newly created then the second best) pending requests.
   
   This PR improves on downscaling. As by checking the scheduler backend we have some more information than checking only the POD snapshot source. With this extra information those pending pods known by the scheduler backend **here** are treated equally with the running pods (as they might run a task). 
   
   Back to the eviction: so from `ExecutorPodsLifecycleManager` we know what will happen with the newly created request not reaching even the pending state. Let's see what will happen with a pending POD which never reaches the running state but known by the scheduler backend: they will die in the same way as any running POD. Either `spark.dynamicAllocation.executorIdleTimeout` is elapsed and the driver kills it (in case of DA) or finishes unexpectedly or if we are lucky it reaches the end of the application.
   
   Those cases cannot be tested here. 
   
   But I agree with you we need more tests: even for `onNewSnapshots` as it is getting more complex. 
   I will look into this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785753915


   **[Test build #135464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135464/testReport)** for PR 31513 at commit [`bb578d2`](https://github.com/apache/spark/commit/bb578d296e77af20b4c3a006ac0349a84630fdad).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780656362


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39779/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785796778


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40044/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-789430418


   Working on the upscaling I just realized there is a flaky part of this new test at [step 0](https://github.com/attilapiros/spark/blob/bb578d296e77af20b4c3a006ac0349a84630fdad/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala#L394-L400). As the order how the resource profiles are processed by the executor PODs allocator is not deterministic. 
   
   I fix this soon in a follow-up PR. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780583109


   **[Test build #135198 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135198/testReport)** for PR 31513 at commit [`a316243`](https://github.com/apache/spark/commit/a316243c7b7c01b0cd95136b236daf30dd65becf).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #31513:
URL: https://github.com/apache/spark/pull/31513#discussion_r575101285



##########
File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
##########
@@ -272,7 +272,8 @@ private[spark] object Config extends Logging {
 
   val KUBERNETES_ALLOCATION_EXECUTOR_TIMEOUT =
     ConfigBuilder("spark.kubernetes.allocation.executor.timeout")
-      .doc("Time to wait before considering a pending executor timedout.")
+      .doc("Time to wait before a newly created executor POD request, which does not reached " +
+        "the POD pending state yet, considered timedout and will be deleted.")
       .version("3.1.0")

Review comment:
       This came up [in a comment of SPARK-34389](https://issues.apache.org/jira/browse/SPARK-34389?focusedCommentId=17282614&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17282614).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780656362


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39779/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-789341338


   Merged to the current dev & 3.1 branches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-778115629


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135122/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780575177


   **[Test build #135198 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135198/testReport)** for PR 31513 at commit [`a316243`](https://github.com/apache/spark/commit/a316243c7b7c01b0cd95136b236daf30dd65becf).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780611933


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135198/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-775881919


   **[Test build #135071 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135071/testReport)** for PR 31513 at commit [`6c0c6e9`](https://github.com/apache/spark/commit/6c0c6e923a4d454ad47ad2975b6514a40d86539d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-774755579


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39575/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jaceklaskowski commented on a change in pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
jaceklaskowski commented on a change in pull request #31513:
URL: https://github.com/apache/spark/pull/31513#discussion_r572216226



##########
File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
##########
@@ -114,8 +114,12 @@ private[spark] class ExecutorPodsAllocator(
 
   private def onNewSnapshots(
       applicationId: String,
+      schedulerBackend: KubernetesClusterSchedulerBackend,
       snapshots: Seq[ExecutorPodsSnapshot]): Unit = {
+    val schedulerKnownExecs = schedulerBackend.getExecutorIds().map(_.toLong).toSet
     newlyCreatedExecutors --= snapshots.flatMap(_.executorPods.keys)
+    val schedulerKnowNewlyCreatedExecs = newlyCreatedExecutors.keySet.intersect(schedulerKnownExecs)

Review comment:
       nit: a typo in `Know`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-778094454


   **[Test build #135122 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135122/testReport)** for PR 31513 at commit [`44392cd`](https://github.com/apache/spark/commit/44392cde7130c36e9898902b31591e79077e1bbe).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-775898419


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135071/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-778143549


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39703/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-775876398


   **[Test build #135071 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135071/testReport)** for PR 31513 at commit [`6c0c6e9`](https://github.com/apache/spark/commit/6c0c6e923a4d454ad47ad2975b6514a40d86539d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780620713


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39779/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros edited a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
attilapiros edited a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-785738931


   Rebased on top of the master and new test is added.
   
   There is new commit e1434cd which changes the flag `hasPendingPods` to `numOutstandingPods` which used in asserts in all the existing tests, all tests passed in that commit:
   ```
   $ git checkout e1434cd
   ...
   $ NOLINT_ON_COMPILE=1 ./build/sbt -Pyarn -Phive -Pkubernetes -Pkubernetes-integration-test
   ...
   sbt:spark-kubernetes> testOnly *.ExecutorPodsAllocatorSuite
   [warn] There may be incompatibilities among your library dependencies; run 'evicted' to see detailed eviction warnings.
   [warn] multiple main classes detected: run 'show discoveredMainClasses' to see the list
   [info] compiling 3 Scala sources to /Users/attilazsoltpiros/git/attilapiros/spark/resource-managers/kubernetes/core/target/scala-2.12/classes ...
   [info] compiling 3 Scala sources and 1 Java source to /Users/attilazsoltpiros/git/attilapiros/spark/core/target/scala-2.12/test-classes ...
   [info] compiling 2 Scala sources to /Users/attilazsoltpiros/git/attilapiros/spark/resource-managers/kubernetes/core/target/scala-2.12/test-classes ...
   [info] ExecutorPodsAllocatorSuite:
   [info] - Initially request executors in batches. Do not request another batch if the first has not finished. (65 milliseconds)
   [info] - Request executors in batches. Allow another batch to be requested if all pending executors start running. (40 milliseconds)
   [info] - When a current batch reaches error states immediately, re-request them on the next batch. (16 milliseconds)
   [info] - When an executor is requested but the API does not report it in a reasonable time, retry requesting that executor. (8 milliseconds)
   [info] - SPARK-28487: scale up and down on target executor count changes (34 milliseconds)
   [info] - SPARK-34334: correctly identify timed out pending pod requests as excess (9 milliseconds)
   [info] - SPARK-33099: Respect executor idle timeout configuration (11 milliseconds)
   [info] - SPARK-33288: multiple resource profiles (21 milliseconds)
   [info] - SPARK-33262: pod allocator does not stall with pending pods (8 milliseconds)
   [info] ScalaTest
   [info] Run completed in 2 seconds, 529 milliseconds.
   [info] Total number of tests run: 9
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780868330


   I think this looks reasonable, the pods would still be eligible for eviction once they transition into a known state and no longer have any tasks present on them, right? Would it be possible to extend the test to include that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-775930495


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39653/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-780600313


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39779/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31513: [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31513:
URL: https://github.com/apache/spark/pull/31513#issuecomment-774754704


   **[Test build #134992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134992/testReport)** for PR 31513 at commit [`fbee783`](https://github.com/apache/spark/commit/fbee7834108c3c65cecac8f65d1159d736e9071a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org