You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ryan Williams (JIRA)" <ji...@apache.org> on 2015/07/20 17:52:04 UTC
[jira] [Commented] (SPARK-9038) Missing TaskEnd event when task attempt is superseded by another (speculative) attempt

    [ https://issues.apache.org/jira/browse/SPARK-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633723#comment-14633723 ] 

Ryan Williams commented on SPARK-9038:
--------------------------------------

Note: this also causes Spark to think that these executors are still running tasks and can't be dynamically un-allocated.

> Missing TaskEnd event when task attempt is superseded by another (speculative) attempt
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-9038
>                 URL: https://issues.apache.org/jira/browse/SPARK-9038
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.1
>            Reporter: Ryan Williams
>
> Yesterday I ran a job that produced [this event log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0].
> There are 17314 {{TaskStart}}'s and 17313 {{TaskEnd}}'s; task ID 15820 (aka 13.0.526.0) is missing a {{TaskEnd}} event.
> A speculative second attempt, ID 16295 (13.0.526.1) finished before it; 15820 was the last taskattempt running in stage-attempt 13.0 and job 3, and when it finished the latter two were each marked as succeeded.
> At the conclusion of stage 13 / job 3, I observed a few things to be in conflicting/inconsistent states:
> *Reflecting 15820 as having finished successfully:*
> * The "stage page" for 13.0 [showed "SUCCESS" in the "Status" column of the per-task-attempt table|http://cl.ly/image/2O0O42382p2W?_ga=1.265890767.118106744.1401937910].
> * The driver stdout reported 15820's successful finish, and that it was being ignored due to another attempt of the same task (16295, per above) having already succeeded:
> {code}
> 15/07/13 23:30:40 INFO scheduler.TaskSetManager: Ignoring task-finished event for 526.0 in stage 13.0 because task 526 has already completed successfully
> 15/07/13 23:30:40 INFO cluster.YarnScheduler: Removed TaskSet 13.0, whose tasks have all completed, from pool
> 15/07/13 23:30:40 INFO scheduler.DAGScheduler: Job 3 finished: collect at JointHistogram.scala:107, took 579.659523 s
> {code}
> *Not reflecting 15820 as having finished at all:*
> * As I mentioned before, [the event log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0] is missing a {{TaskEnd}} for 15820.
> * The {{AllJobsPage}} shows 11258 tasks finished in job 3; it would have been 11259 with 15820.
> ** Additionally, inspecting the page in the DOM revealed a 1-task-wide sliver of light-blue (i.e. running task(s)) in the progress bar.
> ** [This screenshot|http://cl.ly/image/3O201z0e0G2C?_ga=1.265890767.118106744.1401937910] shows both of these on the {{AllJobsPage}}.
> * A history server, pointed at the event log, consistently shows 15820 as not having finished.
> ** This is somewhat unsurprising given that the event log powering the history server doesn't {{TaskEnd}} 15820, but seems notable nonetheless since the live UI seemingly *did* partially record the task as having ended (cf. stage page showing SUCCESS).
> ** Stage page shows 15820 as RUNNING.
> ** AllJobsPage shows 11258 tasks succeeded, 1 running.
> I've gone over the relevant task-success code paths and can't understand how the stage page would show me SUCCESS in the live UI, without anything having been written to the event log or the AllJobsPage's counters having been updated. [Here is a bunch of my driver stdout|https://www.dropbox.com/s/pr7rswt4o2umm20/3010.stdout?dl=0], that shows nothing abnormal afaict; and [the dreaded message about events being dropped|https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40] did not appear anywhere while the app was running, which was one of my only guesses about how this could have happened (but which wouldn't fully explain all of the above anyway).
> Interested in hearing anyones' thoughts about how I might have arrived at this inconsistent state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org