You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Ravi Prakash (JIRA)" <ji...@apache.org> on 2012/10/17 22:30:03 UTC

[jira] [Created] (YARN-167) AM stuck in KILL_WAIT for days when node is lost in the middle

Ravi Prakash created YARN-167:
---------------------------------

             Summary: AM stuck in KILL_WAIT for days when node is lost in the middle
                 Key: YARN-167
                 URL: https://issues.apache.org/jira/browse/YARN-167
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 0.23.3
            Reporter: Ravi Prakash


We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483413#comment-13483413 ] 

Robert Joseph Evans commented on YARN-167:
------------------------------------------

Looking at the UI for one of the jobs that is stuck in this state and a heap dump for that AM, I can see that the JOB is in KILL_WAIT and so are many of its tasks.  But for all of the tasks in KILL_WAIT that I looked at the task attempts are all in FAILED, and none of them failed because of a node that disappeared.  It looks very much like TaskImpl just need to be able to handle T_ATTEMPT_FAILED and T_ATTEMPT_SUCCEEDED in the KILL_WAIT state, instead of ignoring them.  I will look to see if this also exists in 2.0.  I think all we need to do to reproduce this is to launch a large job that will have most of its tasks fail, and then try to kill it before the job fails on its own.

This particular job had 2645 map tasks, 634 of them got stuck in KILL_WAIT, 1347 of them were successfully killed and 623 of the tasks finished with a SUCCESS. This was running on a 2,000 node cluster.  The failed tasks appeared to take about 20 seconds before they failed, but the last attempts to fail all ended within a second of each other.
                
> AM stuck in KILL_WAIT for days
> ------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-167) AM stuck in KILL_WAIT for days when node is lost in the middle

Posted by "Ravi Prakash (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Prakash updated YARN-167:
------------------------------

    Description: We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state  (was: We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list.)
    
> AM stuck in KILL_WAIT for days when node is lost in the middle
> --------------------------------------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482778#comment-13482778 ] 

Vinod Kumar Vavilapalli commented on YARN-167:
----------------------------------------------

bq. Afterwards, the Task Attempt transitions from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED. In either of these states TA_KILL is ignored. So the Task stays in KILL_WAIT and consequently the Job too.
This is fine. Job waits for all tasks and taskAttempts to 'finish', not just killed. In this case, TA will succeed and inform the job about the same, so that the job doesn't wait for this task anymore.

bq. I am rather nervous about back porting MAPREDUCE-3353. It is a major feature that has a significant footprint and was not all that stable when it first went in. I know that it has since stabilized but I am still nervous about such a large change.
Understand that it is a big change, but if we want to address this issue, we need that patch. Given MAPREDUCE-3353 is hardened on trunk, we should considering pulling it in into 0.23. 

bq. It seems like it would be simpler to handle the KILL events in the states that missed it.
There isn't anything like a missed state that is causing this issue if I understand Ravi's issue description correctly.
                
> AM stuck in KILL_WAIT for days
> ------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

Posted by "Ravi Prakash (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483339#comment-13483339 ] 

Ravi Prakash commented on YARN-167:
-----------------------------------

bq. This is fine. Job waits for all tasks and taskAttempts to 'finish', not just killed. In this case, TA will succeed and inform the job about the same, so that the job doesn't wait for this task anymore.

Vinod! I'm sorry I might not be understanding how this happens. In TaskImpl : 
{noformat}
    // Ignore-able transitions.
    .addTransition(
        TaskStateInternal.KILL_WAIT,
        TaskStateInternal.KILL_WAIT,
        EnumSet.of(TaskEventType.T_KILL,
            TaskEventType.T_ATTEMPT_LAUNCHED,
            TaskEventType.T_ATTEMPT_COMMIT_PENDING,
            TaskEventType.T_ATTEMPT_FAILED,
            TaskEventType.T_ATTEMPT_SUCCEEDED,
            TaskEventType.T_ADD_SPEC_ATTEMPT))
{noformat}
So when the TaskAttemptImpl does indeed send T_ATTEMPT_SUCCEEDED, it is ignored by the TaskImpl, and its state stays KILL_WAIT. Am I missing something? Can you please point me to the code path?
                
> AM stuck in KILL_WAIT for days
> ------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483417#comment-13483417 ] 

Robert Joseph Evans commented on YARN-167:
------------------------------------------

Yes it looks very much like this can also happen in branch-2, and trunk.  I also wanted to mention that the stack traces showed more or less nothing.  All of the threads were waiting on I/O or event queues. Nothing was actually processing any data or deadlocked holding some locks.
                
> AM stuck in KILL_WAIT for days
> ------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-167) AM stuck in KILL_WAIT for days

Posted by "Ravi Prakash (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Prakash updated YARN-167:
------------------------------

    Summary: AM stuck in KILL_WAIT for days  (was: AM stuck in KILL_WAIT for days when node is lost in the middle)
    
> AM stuck in KILL_WAIT for days
> ------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-167) AM stuck in KILL_WAIT for days when node is lost in the middle

Posted by "Ravi Prakash (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Prakash updated YARN-167:
------------------------------

    Attachment: TaskAttemptStateGraph.jpg
    
> AM stuck in KILL_WAIT for days when node is lost in the middle
> --------------------------------------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483305#comment-13483305 ] 

Robert Joseph Evans commented on YARN-167:
------------------------------------------

I am still nervous about pulling in a big change like MAPREDUCE-3353 just to fix a Major bug.  I am not going to block this going in if you come up with a patch, but I really want to beat on the patch before we pull it into 0.23.  I just want to be sure that it fixes the issue, and does not destabilize anything. This is only a Major bug because the only time the job gets stuck is when a user sends it a kill command, so the user already wants the job to go away.  The job's tasks do go away, but the AM gets stuck and is taking up a small amount of resources on the queue, which is bad, but not the end of the world.

bq. {quote}There isn't anything like a missed state that is causing this issue if I understand Ravi's issue description correctly. {quote}
bq. Obviously, this could be wrong.

You are correct that the task attempt's state machine cannot really fix this unless it lies, which would be an ugly hack, but it seems that it is not the Task Attempt that is getting stuck.  I was thinking that KILL_WAIT is waiting for the wrong things.  In TaskImpl KILL_WAIT ignores T_ATTEMPT_FAILED and T_ATTEMPT_SUCCEEDED, when it should actually be keeping track of all pending attempts and exit KILL_WAIT when all pending attempts have exited, either with a kill, success or failure.  It is a bug for TaskImpl to assume that as soon as it sends a KILL to the task attempt that it will beat out all other events and kill the attempt.  JobImpl's state machine appears to do something like this already.

                
> AM stuck in KILL_WAIT for days
> ------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482853#comment-13482853 ] 

Vinod Kumar Vavilapalli commented on YARN-167:
----------------------------------------------

bq. There isn't anything like a missed state that is causing this issue if I understand Ravi's issue description correctly.
Obviously, this could be wrong.

Ravi, if you have one of these stuck AMs lying around, can you take a thread dump please?
                
> AM stuck in KILL_WAIT for days
> ------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482716#comment-13482716 ] 

Robert Joseph Evans commented on YARN-167:
------------------------------------------

I am rather nervous about back porting MAPREDUCE-3353.  It is a major feature that has a significant footprint and was not all that stable when it first went in.  I know that it has since stabilized but I am still nervous about such a large change. It seems like it would be simpler to handle the KILL events in the states that missed it.
                
> AM stuck in KILL_WAIT for days
> ------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (YARN-167) AM stuck in KILL_WAIT for days when node is lost in the middle

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli reassigned YARN-167:
--------------------------------------------

    Assignee: Vinod Kumar Vavilapalli

We should backport MAPREDUCE-3353 to 0.23. That automatically fixes this issue in that AM acts on lost nodes and kills the corresponding TaskAttempts which in turn will avoid Job getting stuck in KILL_WAIT state.

Will do the backport.
                
> AM stuck in KILL_WAIT for days when node is lost in the middle
> --------------------------------------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days when node is lost in the middle

Posted by "Ravi Prakash (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482682#comment-13482682 ] 

Ravi Prakash commented on YARN-167:
-----------------------------------

This is my best guess for what is happening. Imagine a job is running, and we send it the KILL signal. The job transitions from RUNNING to KILL_WAIT. The task transitions from RUNNING to KILL_WAIT. However, some task attempts may be in COMMIT_PENDING state. Pasting state graph here for reference 
!TaskAttemptStateGraph.jpg!

If the TA_DONE event is queued in the event queue before the TA_KILL event, then the task attempt is transitioned from COMMIT_PENDING to SUCCESS_CONTAINER_CLEANUP (which we would think should've transitioned to KILL_CONTAINER_CLEANUP, because hey, we sent it TA_KILL and it was in COMMIT_PENDING). Afterwards, the Task Attempt transitions from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED. In either of these states TA_KILL is ignored. So the Task stays in KILL_WAIT and consequently the Job too.
                
> AM stuck in KILL_WAIT for days when node is lost in the middle
> --------------------------------------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira