You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:24:12 UTC

[jira] [Updated] (SPARK-10551) Successful task-end event after task failed due to executor loss

     [ https://issues.apache.org/jira/browse/SPARK-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-10551:
---------------------------------
    Labels: bulk-closed  (was: )

> Successful task-end event after task failed due to executor loss
> ----------------------------------------------------------------
>
>                 Key: SPARK-10551
>                 URL: https://issues.apache.org/jira/browse/SPARK-10551
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.1
>            Reporter: Ryan Williams
>            Priority: Major
>              Labels: bulk-closed
>
> Doing forensics on some failed Spark applications and seeing nonsensical things in the event logs, e.g.:
> {code}
> $ grep -n '"Task ID":12083' application_1439224376754_5702
> 24578:{"Event":"SparkListenerTaskStart","Stage ID":6,"Stage Attempt ID":0,"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
> 28918:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"232"},"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]}}
> 29062:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]},"Task Metrics":{"Host Name":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Executor Deserialize Time":181,"Executor Run Time":1585,"Result Size":8760,"JVM GC Time":0,"Result Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Write Metrics":{"Shuffle Bytes Written":454121,"Shuffle Write Time":43293396,"Shuffle Records Written":2549},"Input Metrics":{"Data Read Method":"Memory","Bytes Read":810520,"Records Read":2549}}}
> {code}
> Task ID 12083 has a TaskStart event, a TaskEnd event indicating that the task failed due to {{ExecutorLostFailure}}, and then a TaskEnd event saying that the task succeeded.
> The history server is not showing me this file in the "complete" or "incomplete" sections, though it has this line in its stdout (and no apparent exceptions later), which I thought meant that it parsed the file correctly:
> {code}
> 15/09/10 17:57:56 INFO FsHistoryProvider: Replaying log path: hdfs://demeter-nn1.demeter.hpc.mssm.edu/spark/tmp/logs/willir31/application_1439224376754_5702
> {code}
> [~arahuja] ran this application originally and says that the live web UI was showing inconsistent/nonsensical data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org