You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/04/13 18:43:12 UTC

[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

    [ https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492617#comment-14492617 ] 

Jason Lowe commented on TEZ-2311:
---------------------------------

The AM appeared to hang because it was still waiting for some vertices to complete.  Looking at the log messages from one of the vertices that did not complete:
{noformat}
2015-04-13 03:05:15,354 INFO [Dispatcher thread: Central] impl.VertexImpl: vertex_1428329756093_332520_1_11 [scope-1965] transitioned from NEW to RECOVERING due to event V_SOURCE_VERTEX_RECOVERED
2015-04-13 03:05:16,443 INFO [Dispatcher thread: Central] impl.VertexImpl: Received upstream event while still recovering, vertexId=vertex_1428329756093_332520_1_11 [scope-1965], vertexEventType=V_SOURCE_VERTEX_STARTED
2015-04-13 03:05:16,444 INFO [Dispatcher thread: Central] impl.VertexImpl: Recovered Vertex State, vertexId=vertex_1428329756093_332520_1_11 [scope-1965], state=RUNNING, numInitedSourceVertices2, numStartedSourceVertices=1, numRecoveredSourceVertices=2, tasksIsNull=false, numTasks=769
2015-04-13 03:05:16,444 INFO [Dispatcher thread: Central] impl.VertexImpl: vertex_1428329756093_332520_1_11 [scope-1965] transitioned from RECOVERING to RUNNING due to event V_SOURCE_VERTEX_RECOVERED
{noformat}

I noticed the "Received upstream event while still recovering" message was only received by the vertices that failed to process the kill event and had not completed.  All the completed vertices did not log this message and completed.  It appears we can buffer kill events during recovery but fail to play them back properly.

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>
> We saw an instance of a Tez job hanging despite receiving multiple kill requests from clients.  The AM was recovering from a prior attempt when the first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)