You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/21 09:29:54 UTC

[GitHub] [spark] seayoun opened a new pull request #26975: Stage retry and executor crash cause app hung up forever

seayoun opened a new pull request #26975: Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975
 
 
   ### **What changes were proposed in this pull request?**
    Fix a bug can cause app hung.
   The bug code analysis and discuss here:
   https://issues.apache.org/jira/browse/SPARK-30325
   The bugs occurs in the corer case as follows:
   1. The stage occurs for fetchFailed and some task hasn't finished, scheduler will resubmit a new stage as retry with those unfinished tasks.
   2. The unfinished task in origin stage finished and the same task on the new retry stage hasn't finished, it will mark the task partition on the new retry stage as succesuful. 
   3. The executor running those 'successful task' crashed, it cause taskSetManager run executorLost to rescheduler the task on the executor, here will cause copiesRunning decreate 1 twice, beause those 'successful task' are not finished, the variable copiesRunning will decreate to -1 as result.
   4. 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis when rescheduler tasks, and now it is -1, can't to reschedule, and the app will hung forever.
   
   Kill tasks which succeeded in origin stage when new retry stage has started the same task and hasn't finished.
   This can alse decreate stage run time, resouce cost.
   
   ### **Why are the changes needed?**
   This will cause app hung up.
   
   ### **Does this PR introduce any user-facing change?**
   No
   
   ### **How was this patch tested?**

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568888423
 
 
   ok to test

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
jiangxb1987 edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568607140
 
 
   I think I can confirm this is a bug and it's caused by we adding the `sched.markPartitionCompletedInAllTaskSets` logic, that when a task attempt from one TSM succeeded it shall mark the partition as completed for all the TSMs targeting the same Stage. Unfortunately the missing part is we didn't try to kill the running task attempts when we mark the partitions as completed, thus when the running task attetmpts failed with ExecutorLost it would revert the completed partition result (which is not necessary in this case).
   
   To me the best solution here would be to kill all the running task attempts for the completed partition in the TSM inside method `markPartitionCompleted`, this would resolve the issue without any side affect.
   
   Also cc @squito @cloud-fan @Ngone51 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569570285
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itskals commented on a change in pull request #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
itskals commented on a change in pull request #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#discussion_r360781112
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -100,6 +100,11 @@ private[spark] class TaskSetManager(
   // should not resubmit while executor lost.
   private val killedByOtherAttempt = new HashSet[Long]
 
+  // Add the tid of task into this HashSet when the task is killed by other stage retries.
+  // For example, if stage failed and retry, when the task in the origin stage finish, it will
+  // kill the new stage task running the same partition data
+  private val killedByOtherStageRetries = new HashSet[Long]
 
 Review comment:
   Why is this hashset not looked up in case of handling `executorLost` ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568914908
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
jiangxb1987 edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568607140
 
 
   I think I can confirm this is a bug and it's caused by we adding the `sched.markPartitionCompletedInAllTaskSets` logic, that when a task attempt from one TSM succeeded it shall mark the partition as completed for all the TSMs targeting the same Stage. Unfortunately the missing part is we didn't try to kill the running task attempts when we mark the partitions as completed, thus when the running task attetmpts failed with ExecutorLost it would revert the completed partition result (which is not necessary in this case).
   
   To me the best solution here would be to kill all the running task attempts in the TSM inside method `markPartitionCompleted`, this would resolve the issue without any side affect.
   
   Also cc @squito @cloud-fan @Ngone51 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568888933
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569590023
 
 
   Note that, a task roughly has 2 states: running and finished. a partition has 2 states as well: successful or not.
   
   `TaskSetManager.successful` tracks the successfulness of each partition, and is kind of synced among all TSMs, via `TaskSetManager.markPartitionCompleted`.
   
   It's obviously a bug that `TaskSetManager.executorLost` checks the status of task and partition separately. A task may satisfy both conditions and be handled twice.
   
   For case 1: I think it's OK as the new task will override the map status (the new task has a bigger epoch). It's a waste of resources, but it's better than hang.
   
   For case 2: I don't think it can happen. If T1 finished first, the partition in TSM2 will be marked as successful too.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun edited a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun edited a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568676054
 
 
   @cloud-fan The task status is inconsistent indeed, however we need to avoid `Resubmitted` instead of `handleFailedTask`,   
   > change task.running to !successful(task.index) && task.running in executorLost
   
   change this will cause `Resubmit` and rescheduler the task again, `handleFailedTask` won't resheculer it

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#discussion_r361573015
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -939,7 +939,10 @@ private[spark] class TaskSetManager(
         && !isZombie) {
       for ((tid, info) <- taskInfos if info.executorId == execId) {
         val index = taskInfos(tid).index
-        if (successful(index) && !killedByOtherAttempt.contains(tid)) {
+        // We may have a running task whose partition has been marked as successful,
+        // because this partition has another task in another stage attempt.
 
 Review comment:
   `this partition has another task completed in another stage attempt.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-574031555
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116679/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573787615
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569853896
 
 
   > Let me expand on case 2:
   > If T1 finished first, the partition in TSM2 (notated as P1) will be marked as successful too. Then the executor get lost, since T2 is still running, we won't change `successful(P1)` to false.
   > Then, possibly other partitions in TSM2 could be marked as successful by other tasks, then TSM2 think all the partitions has been finished, but actually P1 has been lost and not computed again.
   
   
   There another two cases in this situation as follows:
   
   1. T1 and T2 run on different executors, it doesn't matter.
   2. T1 and T2 run on same executor, T2 will not retry since T1 has succeeded. 
   Think like this situation:
    A stage has finished and then an executor holding the stage's shuffle file got lost, we can't rescheduler since it has finished, we will retry by next stage got `FetchFailedException`.
   This case like this we disscussed, we won't reschedule the task in the finished TSM, I think it is similar.
   
   So, I think this is reasonable, what do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568888938
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20558/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903992
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573787623
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116652/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568676054
 
 
   @cloud-fan The task status is inconsistent indeed, however we need to avoid `Resubmitted` instead of `handleFailedTask`,   
   > change task.running to !successful(task.index) && task.running in executorLost
   change this will cause `Resubmit` and rescheduler the task again, `handleFailedTask` won't resheculer it, WDYT ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-571408541
 
 
   @jiangxb1987 @Ngone51 @cloud-fan

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
seayoun edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569853896
 
 
   > Let me expand on case 2:
   > If T1 finished first, the partition in TSM2 (notated as P1) will be marked as successful too. Then the executor get lost, since T2 is still running, we won't change `successful(P1)` to false.
   > Then, possibly other partitions in TSM2 could be marked as successful by other tasks, then TSM2 think all the partitions has been finished, but actually P1 has been lost and not computed again.
   
   
   There another two cases in this situation as follows:
   
   1. T1 and T2 run on different executors, it doesn't matter.
   2. T1 and T2 run on same executor, T2 will not retry since T1 has succeeded. 
   Think like this situation:
    A stage has finished and then an executor holding the stage's shuffle file got lost, we can't rescheduler since it has finished, we will retry by next stage got `FetchFailedException`.
   This case like this we disscussed, **we won't reschedule the task in the finished TSM when executor got lost**, I think it is similar.
   
   So, I think this is reasonable, what do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568672632
 
 
   I think the direct problem is, the task status is inconsistent: a task can satisfy both `successful(task.index)` and `task.running`. Such tasks will be handled twice in `executorLost` and mess up the internal status.
   
   A simple fix can be: change `task.running` to `!successful(task.index) && task.running` in `executorLost`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568343977
 
 
   cc @dongjoon-hyun @wangshuo128 @squito 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568914911
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115773/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
jiangxb1987 edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569832087
 
 
   Let me expand on case 2:
   If T1 finished first, the partition in TSM2 (notated as P1) will be marked as successful too. Then the executor get lost, since T2 is still running, we won't change `successful(P1)` to false.
   Then, possibly other partitions in TSM2 could be marked as successful by other tasks, then TSM2 think all the partitions has been finished, but actually P1 has been lost and not computed again.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568167085
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569375217
 
 
   For case 1, if T2 successfully completed after T1, will T2 overriders T1's `MapStatus`?
   
   For case 2, I haven't understand yet:
   
   > since T2 is still running the partition will not be marked as not successful.
   
   So, the partition has marked as successful by T1?
   
   > After a while maybe another task finished and mark the TSM as finished
   
   Assuming there's a T0 from M0 finished, but mark which TSM as finished? TSM1 or TSM2 or both?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568890537
 
 
   > LGTM. Can you update the PR description?
   
   @cloud-fan thanks for your review!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573731208
 
 
   **[Test build #116652 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116652/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568914911
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115773/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
seayoun edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569853896
 
 
   > Let me expand on case 2:
   > If T1 finished first, the partition in TSM2 (notated as P1) will be marked as successful too. Then the executor get lost, since T2 is still running, we won't change `successful(P1)` to false.
   > Then, possibly other partitions in TSM2 could be marked as successful by other tasks, then TSM2 think all the partitions has been finished, but actually P1 has been lost and not computed again.
   
   
   There are another two cases in this situation as follows:
   
   1. T1 and T2 run on different executors, it doesn't matter.
   2. T1 and T2 run on same executor, T2 will not retry since T1 has succeeded. 
   Think like this situation:
    A stage has finished and then an executor holding the stage's shuffle file got lost, we can't rescheduler since it has finished, we will retry by next stage got `FetchFailedException`.
   This case like this we disscussed, **we won't reschedule the task in the finished TSM when executor got lost**, I think it is similar.
   
   So, I think this is reasonable, what do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-574078951
 
 
   thanks, merging to master!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568888933
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573731943
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903381
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on a change in pull request #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on a change in pull request #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#discussion_r361074152
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -698,9 +704,13 @@ private[spark] class TaskSetManager(
         totalResultSize -= resultSizeAcc.get.asInstanceOf[LongAccumulator].value
       }
 
+      val killedReason = if (killedByOtherAttempt.contains(tid)) {
+        TaskKilled("Finish but did not commit due to another attempt succeeded")
+      } else {
+        TaskKilled("Finish but did not commit due to task in another stage retry succeeded")
 
 Review comment:
   `killedByOtherStageRetries` is also used in `executorLost` to avoid `Resubmitted`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#discussion_r361572508
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -939,7 +939,10 @@ private[spark] class TaskSetManager(
         && !isZombie) {
       for ((tid, info) <- taskInfos if info.executorId == execId) {
         val index = taskInfos(tid).index
-        if (successful(index) && !killedByOtherAttempt.contains(tid)) {
+        // We may have a running task whose partition has been marked as successful,
+        // because this partition has another task in another stage attempt.
+        // We will `handleFailedTask` at next if the partition's task hasn't finished.
 
 Review comment:
   `we will call`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
jiangxb1987 removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569365197
 
 
   There are multiple corner cases not handled by current solution:
   Image we have two TSMs (M1 and M2) working on the same Stage, and for the corresponding tasks are notated as T1 and T2 for a specific partition:
   1. T1 and T2 might be scheduled on different executors (E1 and E2), T1 has been finished but T2 is still running. Then E2 get lost, in the approach suggested by this PR, the partition in M2 will be marked as not successful and a new pending task would be added, which is actually not necessary because the shuffle files are on E1;
   2. T1 and T2 might be scheduled on the same executor, T1 has been finished but T2 is still running. Then the executor get lost, since T2 is still running the partition will not be marked as not successful. After a while maybe another task finished and mark the TSM as finished, but actually the shuffle files get lost, thus it lead to a new regression.
   
   I haven't get a solution here. I'm thinking whether we can put the successful task information into `taskInfos` inside `markPartitionCompleted`, if this is possible then the second problem I mentioned above could probably get resolved.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573729603
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573787615
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on a change in pull request #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on a change in pull request #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#discussion_r360880185
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -100,6 +100,11 @@ private[spark] class TaskSetManager(
   // should not resubmit while executor lost.
   private val killedByOtherAttempt = new HashSet[Long]
 
+  // Add the tid of task into this HashSet when the task is killed by other stage retries.
+  // For example, if stage failed and retry, when the task in the origin stage finish, it will
+  // kill the new stage task running the same partition data
+  private val killedByOtherStageRetries = new HashSet[Long]
 
 Review comment:
   I have handled it in `executorLost `, thank you!
   The rest of code is going to kill running tasks when it has succeeded in another stage to decrease execution time and resource.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573731955
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21431/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568625184
 
 
   > I think I can confirm this is a bug and it's caused by we adding the `sched.markPartitionCompletedInAllTaskSets` logic, that when a task attempt from one TSM succeeded it shall mark the partition as completed for all the TSMs targeting the same Stage. Unfortunately the missing part is we didn't try to kill the running task attempts when we mark the partitions as completed, thus when the running task attetmpts failed with ExecutorLost it would revert the completed partition result (which is not necessary in this case).
   > 
   > To me the best solution here would be to kill all the running task attempts for the completed partition in the TSM inside method `markPartitionCompleted`, this would resolve the issue without any side affect.
   > 
   > Also cc @squito @cloud-fan @Ngone51
   
   Expecting your code review! @jiangxb1987 @squito @cloud-fan @Ngone51 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568659508
 
 
   > seems speculative task has the same issue? We may have multiple running tasks for one partition.
   
   @cloud-fan `markPartitionCompleted` kill logic is like speculative task, when we finished the other mutiple running tasks, we ignore and mark it `Killed(another attempt succeeded)` in `handleSuccessfulTask` or won't scheduler it in `handleFailedTask`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573986415
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21458/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
seayoun commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#discussion_r361315646
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -939,7 +939,7 @@ private[spark] class TaskSetManager(
         && !isZombie) {
       for ((tid, info) <- taskInfos if info.executorId == execId) {
         val index = taskInfos(tid).index
-        if (successful(index) && !killedByOtherAttempt.contains(tid)) {
+        if (successful(index) && !info.running && !killedByOtherAttempt.contains(tid)) {
 
 Review comment:
   ok, tks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569570285
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569570289
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20719/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569590666
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115928/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569571087
 
 
   **[Test build #115928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115928/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569832087
 
 
   Let me expand on case 2:
   If T1 finished first, the partition in TSM2 (notated as P1) will be marked as successful too. Then the executor get lost, since T2 is still running, we won't change `successful(P1)` to false.
   Then, possibly another partition in TSM2 could be marked as successful by other tasks, then TSM2 think all the partitions has been finished, but actually P1 has been lost and not computed again.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573731955
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21431/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569570289
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20719/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568896104
 
 
   @srowen @maropu @vanzin @HyukjinKwon @dongjoon-hyun 
   PLAT.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569590664
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569574798
 
 
   @jiangxb1987 @Ngone51 
   For case 1, I think it is acceptable, if T1 and T2 both finished, the E1 and E2 have the same chance to get lost, if E2 get lost, we can `Resubmit` at current stage; however if E1 got lost, we can perceive only by next stage got `FetchFailedException` to rescheduler the partition, it is high cost.
   For case 2, I has the save question as @Ngone51 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568914908
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-570142340
 
 
   If T1 finished first, but T2 is still running, and executor is lost. There are 2 cases:
   1. the stage is already finished, then we will hit fetch failure later and retry stage.
   2. the stage is still running. Then both TSM1 and TSM2 call `executorLost`. TSM1 will resubmit a task for P1. TSM2 may mark itself as finished. This is OK as we still have a task submitted for P1.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568625184
 
 
   > I think I can confirm this is a bug and it's caused by we adding the `sched.markPartitionCompletedInAllTaskSets` logic, that when a task attempt from one TSM succeeded it shall mark the partition as completed for all the TSMs targeting the same Stage. Unfortunately the missing part is we didn't try to kill the running task attempts when we mark the partitions as completed, thus when the running task attetmpts failed with ExecutorLost it would revert the completed partition result (which is not necessary in this case).
   > 
   > To me the best solution here would be to kill all the running task attempts for the completed partition in the TSM inside method `markPartitionCompleted`, this would resolve the issue without any side affect.
   > 
   > Also cc @squito @cloud-fan @Ngone51
   
   Expecting your code review!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568834508
 
 
   @cloud-fan PTAL, I have a deep think and remove the kill logic, and in `handleSuccessfulTask` I think we needn't handle this case, it can overwrite the shuffle meta info, beause if the executor keeping the partition shuffle data was lost, we can `Resubmit` this partition in current stage instead of reschedule the partition in the next stage by `FetchFailedException`
   cc @HyukjinKwon  @jiangxb1987 @Ngone51 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903383
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20567/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573986415
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21458/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568653582
 
 
   seems speculative task has the same issue? We may have multiple running tasks for one partition.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#discussion_r361572508
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -939,7 +939,10 @@ private[spark] class TaskSetManager(
         && !isZombie) {
       for ((tid, info) <- taskInfos if info.executorId == execId) {
         val index = taskInfos(tid).index
-        if (successful(index) && !killedByOtherAttempt.contains(tid)) {
+        // We may have a running task whose partition has been marked as successful,
+        // because this partition has another task in another stage attempt.
+        // We will `handleFailedTask` at next if the partition's task hasn't finished.
 
 Review comment:
   `We treat it as a running task and will call handleFailedTask later`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569590664
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569365741
 
 
   There are multiple corner cases not handled by current solution:
   Image we have two TSMs (M1 and M2) working on the same Stage, and for the corresponding tasks are notated as T1 and T2 for a specific partition:
   
   1. T1 and T2 might be scheduled on different executors (E1 and E2), both tasks have been finished. Then E2 get lost, in the approach suggested by this PR, the partition in M2 will be marked as not successful and a new pending task would be added, which is actually not necessary because the shuffle files are on E1;
   2. T1 and T2 might be scheduled on the same executor, T1 has been finished but T2 is still running. Then the executor get lost, since T2 is still running the partition will not be marked as not successful. After a while maybe another task finished and mark the TSM as finished, but actually the shuffle files get lost, thus it lead to a new regression.
   I haven't get a solution here. I'm thinking whether we can put the successful task information into taskInfos inside markPartitionCompleted, if this is possible then the second problem I mentioned above could probably get resolved.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itskals commented on a change in pull request #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
itskals commented on a change in pull request #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#discussion_r360782455
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -100,6 +100,11 @@ private[spark] class TaskSetManager(
   // should not resubmit while executor lost.
   private val killedByOtherAttempt = new HashSet[Long]
 
+  // Add the tid of task into this HashSet when the task is killed by other stage retries.
+  // For example, if stage failed and retry, when the task in the origin stage finish, it will
+  // kill the new stage task running the same partition data
+  private val killedByOtherStageRetries = new HashSet[Long]
 
 Review comment:
   Also the part of code that you marked as problematic in `executorLost` , could it have not been moved to `handleFailedTask`? I feel the code could have looked more clearer there and then rest of the changes might not have been needed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-574031548
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573985376
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun edited a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun edited a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568890537
 
 
   > LGTM. Can you update the PR description?
   
   @cloud-fan Ok, and thanks for your review!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573731943
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573787214
 
 
   **[Test build #116652 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116652/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573986413
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
jiangxb1987 edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569365741
 
 
   There are multiple corner cases not handled by current solution:
   Image we have two TSMs (M1 and M2) working on the same Stage, and for the corresponding tasks are notated as T1 and T2 for a specific partition:
   
   1. T1 and T2 might be scheduled on different executors (E1 and E2), both tasks have been finished. Then E2 get lost, in the approach suggested by this PR, the partition in M2 will be marked as not successful and a new pending task would be added, which is actually not necessary because the shuffle files are on E1;
   2. T1 and T2 might be scheduled on the same executor, T1 has been finished but T2 is still running. Then the executor get lost, since T2 is still running the partition will not be marked as not successful. After a while maybe another task finished and mark the TSM as finished, but actually the shuffle files get lost, thus it lead to a new regression.
   
   I haven't get a solution here. I'm thinking whether we can put the successful task information into taskInfos inside markPartitionCompleted, if this is possible then the second problem I mentioned above could probably get resolved.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903383
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20567/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568168893
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#discussion_r361314146
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -939,7 +939,7 @@ private[spark] class TaskSetManager(
         && !isZombie) {
       for ((tid, info) <- taskInfos if info.executorId == execId) {
         val index = taskInfos(tid).index
-        if (successful(index) && !killedByOtherAttempt.contains(tid)) {
+        if (successful(index) && !info.running && !killedByOtherAttempt.contains(tid)) {
 
 Review comment:
   let's add some code comments here to explain what's going on. e.g. we may have a running task whose partition has been marked as successful, because this partition has another task in another stage attempt.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568888780
 
 
   **[Test build #115765 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115765/testReport)** for PR 26975 at commit [`5e73417`](https://github.com/apache/spark/commit/5e734177fbf21f6dc0fcd5cb5e7124a3504ea5d7).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573986167
 
 
   **[Test build #116679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116679/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itskals edited a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
itskals edited a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568664326
 
 
   I was of the opinion that when a task is started by a stage attempt and still in progress, no subsequent retries from other stage attempt must be made, unless it is fate is known. 
   To know if the partition is already assigned to some task, the MapStatus entry for the partition could denote the intermediate step.(As of now MapStatusEntry is either null or filled, kind of boolean. I think we can have the third stage).
   
   By this proposed model, we can have the compute resources also saved(no need to start a redundant computation if one stage attempt is already working on it). However, we allow speculation as its within same stage attempt. 
   
   DO let me know if there is any shortcomings in this thought process.
   
   @cloud-fan @seayoun @jiangxb1987

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itskals commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
itskals commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568664326
 
 
   I was of the opinion that the when a task is started by a previous stage and still in progress, no subsequent retries from other stage attempt must be made, unless it is fate is known. 
   To know if the partition is already assigned to some task, the MapStatus entry for the partition could denote the intermediate step.(As of now MapStatusEntry is either null or filled, kind of boolean. I think we can have the third stage).
   I am not sure if there is any issue in making such assumption. I can try to elaborate the design if needed. 
   By this proposed model, we can have the compute resources also saved(no need to start a redundant computation if one stage attempt is already working on it). However, we allow speculation as its within same stage attempt. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itskals edited a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
itskals edited a comment on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568664326
 
 
   I am of the opinion that when a task is started by a stage attempt and still in progress, no subsequent retries from other stage attempt must be made, unless it is fate is known. 
   To know if the partition is already assigned to some task, the MapStatus entry for the partition could denote the intermediate step.(As of now MapStatusEntry is either null or filled, kind of boolean. I think we can have the third stage).
   
   By this proposed model, we can have the compute resources also saved(no need to start a redundant computation if one stage attempt is already working on it). However, we allow speculation as its within same stage attempt. 
   
   DO let me know if there is any shortcomings in this thought process.
   
   @cloud-fan @seayoun @jiangxb1987

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569590666
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115928/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568888780
 
 
   **[Test build #115765 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115765/testReport)** for PR 26975 at commit [`5e73417`](https://github.com/apache/spark/commit/5e734177fbf21f6dc0fcd5cb5e7124a3504ea5d7).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
seayoun edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569853896
 
 
   > Let me expand on case 2:
   > If T1 finished first, the partition in TSM2 (notated as P1) will be marked as successful too. Then the executor get lost, since T2 is still running, we won't change `successful(P1)` to false.
   > Then, possibly other partitions in TSM2 could be marked as successful by other tasks, then TSM2 think all the partitions has been finished, but actually P1 has been lost and not computed again.
   
   
   There another two cases in this situation as follows:
   
   1. T1 and T2 run on different executors, it doesn't matter.
   2. T1 and T2 run on same executor, T2 will not retry since T1 has succeeded. 
   Think like this situation:
    A stage has finished and then an executor holding the stage's shuffle file got lost, we can't rescheduler since it has finished, we will retry by next stage got `FetchFailedException`.
   **This case like this we disscussed, we won't reschedule the task in the finished TSM, I think it is similar.**
   
   So, I think this is reasonable, what do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569590405
 
 
   **[Test build #115928 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115928/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573986167
 
 
   **[Test build #116679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116679/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569158633
 
 
   Is it possible to add a new test case in TaskSetManagerSuite ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
jiangxb1987 commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568607140
 
 
   I think I can confirm this is a bug and it's caused by we adding the `sched.markPartitionCompletedInAllTaskSets` logic, that when a task attempt from one TSM succeeded it shall mark the partition as completed for all the TSMs targeting the same Stage. Unfortunately the missing part is we didn't try to kill the running task attempts when we mark the partitions as completed, thus when the running task attetmpts failed with ExecutorLost it would revert the completed partition result (which is not necessary).
   
   To me the best solution here would be to kill all the running task attempts in the TSM inside method `markPartitionCompleted`, this would resolve the issue without any side affect.
   
   Also cc @squito @cloud-fan @Ngone51 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568652653
 
 
   OK to test

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573986413
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568658757
 
 
   > I'm not sure killing tasks can work. There is no guarantee that a task can always be killed successfully. And even if we can, we may send out the kill request, and immediately get the executor lost event before the task is killed.
   > 
   > I think we should accept the fact that a running task may be useless as its corresponding partition is completed, and deal with it well. e.g. when seeing executor lost, don't reschedule tasks whose corresponding partitions are already completed.
   
   I think it doesn't matter, if driver  immediately get the executor lost event before the task is killed, the TSM will  `handleFailedTask` and will not scheduler it;
   Btw, app process the task success or failed status in `handleSuccessfulTask` or `handleFailedTask` if the task finished before killed; In `handleSuccessfulTask`, we mark it as `Killed(another stage succeeded)`, in `handleFailedTask`, we will not reschedule the task.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568472226
 
 
   @Ngone51 Please look at this patch, thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569571087
 
 
   **[Test build #115928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115928/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903994
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115765/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
seayoun removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-571408541
 
 
   @jiangxb1987 @Ngone51 @cloud-fan

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-574031548
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903791
 
 
   **[Test build #115765 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115765/testReport)** for PR 26975 at commit [`5e73417`](https://github.com/apache/spark/commit/5e734177fbf21f6dc0fcd5cb5e7124a3504ea5d7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568653284
 
 
   I'm not sure killing tasks can work. There is no guarantee that a task can always be killed successfully. And even if we can, we may send out the kill request, and immediately get the executor lost event before the task is killed.
   
   I think we should accept the fact that a running task may be useless as its corresponding partition is completed, and deal with it well. e.g. when seeing executor lost, don't reschedule tasks whose corresponding partitions are already completed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573787623
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116652/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568659508
 
 
   > seems speculative task has the same issue? We may have multiple running tasks for one partition.
   
   `markPartitionCompleted` kill logic is like speculative task, when we finished the other mutiple running tasks, we ignore and mark it `Killed(another attempt succeeded)` in `handleSuccessfulTask` or won't scheduler it in `handleFailedTask`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-573731208
 
 
   **[Test build #116652 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116652/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-574031555
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116679/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#discussion_r361314146
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -939,7 +939,7 @@ private[spark] class TaskSetManager(
         && !isZombie) {
       for ((tid, info) <- taskInfos if info.executorId == execId) {
         val index = taskInfos(tid).index
-        if (successful(index) && !killedByOtherAttempt.contains(tid)) {
+        if (successful(index) && !info.running && !killedByOtherAttempt.contains(tid)) {
 
 Review comment:
   let's add some code comments here to explain what's going on. e.g. we may have a running task whose partition has been marked as successful.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
jiangxb1987 edited a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569365741
 
 
   There are multiple corner cases not handled by current solution:
   Image we have two TSMs (M1 and M2) working on the same Stage, and for the corresponding tasks are notated as T1 and T2 for a specific partition:
   
   1. T1 and T2 might be scheduled on different executors (E1 and E2), both tasks have been finished. Then E2 get lost, in the approach suggested by this PR, the partition in M2 will be marked as not successful and a new pending task would be added, which is actually not necessary because the shuffle files are on E1;
   2. T1 and T2 might be scheduled on the same executor, T1 has been finished but T2 is still running. Then the executor get lost, since T2 is still running the partition will not be marked as not successful. After a while maybe another task finished and mark the TSM as finished, but actually the shuffle files get lost, thus it lead to a new regression.
   
   I haven't get a solution here. I'm thinking whether we can put the successful task information into `taskInfos` inside `markPartitionCompleted`, if this is possible then the second problem I mentioned above could probably get resolved.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903992
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903381
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#discussion_r361073047
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -698,9 +704,13 @@ private[spark] class TaskSetManager(
         totalResultSize -= resultSizeAcc.get.asInstanceOf[LongAccumulator].value
       }
 
+      val killedReason = if (killedByOtherAttempt.contains(tid)) {
+        TaskKilled("Finish but did not commit due to another attempt succeeded")
+      } else {
+        TaskKilled("Finish but did not commit due to task in another stage retry succeeded")
 
 Review comment:
   is it really worth to add `killedByOtherStageRetries` just for a slightly better error message?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568167085
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568168893
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903270
 
 
   **[Test build #115773 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115773/testReport)** for PR 26975 at commit [`623639a`](https://github.com/apache/spark/commit/623639aab2eb5b048b4515de837f9e142e929f55).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903270
 
 
   **[Test build #115773 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115773/testReport)** for PR 26975 at commit [`623639a`](https://github.com/apache/spark/commit/623639aab2eb5b048b4515de837f9e142e929f55).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568888592
 
 
   LGTM. Can you update the PR description?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568914756
 
 
   **[Test build #115773 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115773/testReport)** for PR 26975 at commit [`623639a`](https://github.com/apache/spark/commit/623639aab2eb5b048b4515de837f9e142e929f55).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568682889
 
 
   I don't think it's safe to not reschedule it. Looking at the comment in `executorLost`, we want to reschedule because the shuffle files are all lost in this executor. The special case is `killedByOtherAttempt.contains(tid)`, which means a speculative task has finished on **another executor**. For the stage attempt, there is no guarantee that 2 tasks of the same partition will be run on different executors.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-569365197
 
 
   There are multiple corner cases not handled by current solution:
   Image we have two TSMs (M1 and M2) working on the same Stage, and for the corresponding tasks are notated as T1 and T2 for a specific partition:
   1. T1 and T2 might be scheduled on different executors (E1 and E2), T1 has been finished but T2 is still running. Then E2 get lost, in the approach suggested by this PR, the partition in M2 will be marked as not successful and a new pending task would be added, which is actually not necessary because the shuffle files are on E1;
   2. T1 and T2 might be scheduled on the same executor, T1 has been finished but T2 is still running. Then the executor get lost, since T2 is still running the partition will not be marked as not successful. After a while maybe another task finished and mark the TSM as finished, but actually the shuffle files get lost, thus it lead to a new regression.
   
   I haven't get a solution here. I'm thinking whether we can put the successful task information into `taskInfos` inside `markPartitionCompleted`, if this is possible then the second problem I mentioned above could probably get resolved.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-574030980
 
 
   **[Test build #116679 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116679/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568888938
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20558/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/26975#issuecomment-568903994
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115765/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568715857
 
 
   @cloud-fan I think you are right ! I will process it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever

Posted by GitBox <gi...@apache.org>.
jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568661701
 
 
   @cloud-fan I mean add the running task attempts to `killedByOtherAttempt`, thus when they failed,  the completed partitions won't be affected.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org