You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2015/07/24 02:37:04 UTC

[jira] [Comment Edited] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

    [ https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639538#comment-14639538 ] 

Jeff Zhang edited comment on TEZ-2311 at 7/24/15 12:36 AM:
-----------------------------------------------------------

I thought about the adding recovery log for DAG kill operation, but may be a little heavy here. Think it again, seems not difficult ( Will post another patch ). The change in VertexImpl doesn't resolve the hang issue completely. Consider one case that some vertices are recovered to KILLED, and some vertices are recovered to running and new task attempt is scheduled. That new task attempt may wait there indefinitely for datamovement events from its upstream. Or maybe task attempt is not scheduled, its VertexManager may wait there for something from upstream. 


was (Author: zjffdu):
I thought about the adding recovery log for DAG kill operation, but may be a little heavy here. Think it again, seems not difficult ( Will post another patch ). The change in VertexImpl doesn't resolve the hang issue completely. Consider one case that all the vertices are recovered to KILLED, and one vertex is recovered to running and new task attempt is scheduled. That new task attempt may wait there indefinitely for datamovement events from its upstream. Or maybe task attempt is not scheduled, its VertexManager may wait there for something from upstream. 

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill requests from clients.  The AM was recovering from a prior attempt when the first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)