You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yavgeni Hotimsky (JIRA)" <ji...@apache.org> on 2018/07/18 12:35:00 UTC

[jira] [Updated] (SPARK-24848) When a stage fails onStageCompleted is called before onTaskEnd

     [ https://issues.apache.org/jira/browse/SPARK-24848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yavgeni Hotimsky updated SPARK-24848:
-------------------------------------
    Description: 
It seems that when a stage fails because one of it's tasks failed too many times the onStageCompleted callback of the SparkListener is called before the onTaskEnd listener for the failing task. We're using structured streaming in this case.

We noticed this because we built a listener to track the precise number of active tasks to be exported as a metric and was using the stage callback to maintain a map from stage ids to some metadata extracted from the jobGroupId. The onStageCompleted listener was removing from the map to prevent unbounded memory usage and in this case I could see the onTaskEnd callback was being called after the onStageCompleted callback so it couldn't find the stageId in the map. We worked around it by replacing the map with a timed cache.

  was:
It seems that when a stage fails because one of it's tasks failed too many times the onStageCompleted callback of the SparkListener is called before the onTaskEnd listener for the failing task. We're using structured streaming in this case.

We noticed this because we built a listener to track the precise number of active tasks per one of my processes to be exported as a metric and was using the stage callback to maintain a map from stage ids to some metadata extracted from the jobGroupId. The onStageCompleted listener was removing from the map to prevent unbounded memory and in this case I could see the onTaskEnd callback was being called after the onStageCompleted callback so it couldn't find the stageId in the map. We worked around it by replacing the map with a timed cache.


> When a stage fails onStageCompleted is called before onTaskEnd
> --------------------------------------------------------------
>
>                 Key: SPARK-24848
>                 URL: https://issues.apache.org/jira/browse/SPARK-24848
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Yavgeni Hotimsky
>            Priority: Minor
>
> It seems that when a stage fails because one of it's tasks failed too many times the onStageCompleted callback of the SparkListener is called before the onTaskEnd listener for the failing task. We're using structured streaming in this case.
> We noticed this because we built a listener to track the precise number of active tasks to be exported as a metric and was using the stage callback to maintain a map from stage ids to some metadata extracted from the jobGroupId. The onStageCompleted listener was removing from the map to prevent unbounded memory usage and in this case I could see the onTaskEnd callback was being called after the onStageCompleted callback so it couldn't find the stageId in the map. We worked around it by replacing the map with a timed cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org