You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/30 04:20:12 UTC

[GitHub] [spark] mridulm commented on a diff in pull request #38702: [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen

mridulm commented on code in PR #38702:
URL: https://github.com/apache/spark/pull/38702#discussion_r1035507465


##########
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala:
##########
@@ -645,8 +645,15 @@ private[spark] class AppStatusListener(
   }
 
   override def onTaskEnd(event: SparkListenerTaskEnd): Unit = {
-    // TODO: can this really happen?
-    if (event.taskInfo == null) {
+    // TODO: can taskInfo null really happen?

Review Comment:
   To answer this question, no `taskInfo` cannot be `null`  IMO - can you also take a look at it @Ngone51 ?
   We can remove this check and the TODO once we are sure.



##########
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala:
##########
@@ -645,8 +645,11 @@ private[spark] class AppStatusListener(
   }
 
   override def onTaskEnd(event: SparkListenerTaskEnd): Unit = {
-    // TODO: can this really happen?
-    if (event.taskInfo == null) {
+    // TODO: can taskInfo null really happen?
+    // For resubmitted tasks caused by ExecutorLost, the SparkListenerTaskEnd is useless and
+    // will make activeTask in stage to be negative, this will cause stage not be removed in
+    // liveStages, and finally cause executor not removed in deadExecutors
+    if (event.taskInfo == null ||  event.reason == Resubmitted) {

Review Comment:
   Couple of things to clarify (the pr description/comments confused me a bit).
   
   a) There was a pair of task start and task end event which were fired for the task (let us call it Tr)
   b) When executor which ran Tr was lost, while stage is still running, a `Resubmitted` is fired for Tr.
   c) Subsequently, a new task start and task end will be fired for the retry of Tr.
   
   
   The issue here is, (b) is associated with (a) - which was a success earlier, but has now been marked as a failure.
   
   
   We should not be ignoring this event, but rather deal with it properly.
   For example, update counters, number of failed tasks, etc.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org