You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/23 23:42:23 UTC

[GitHub] [spark] jiangxb1987 edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever

jiangxb1987 edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568607140
 
 
   I think I can confirm this is a bug and it's caused by we adding the `sched.markPartitionCompletedInAllTaskSets` logic, that when a task attempt from one TSM succeeded it shall mark the partition as completed for all the TSMs targeting the same Stage. Unfortunately the missing part is we didn't try to kill the running task attempts when we mark the partitions as completed, thus when the running task attetmpts failed with ExecutorLost it would revert the completed partition result (which is not necessary in this case).
   
   To me the best solution here would be to kill all the running task attempts in the TSM inside method `markPartitionCompleted`, this would resolve the issue without any side affect.
   
   Also cc @squito @cloud-fan @Ngone51 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org