You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Todd Lipcon (Created) (JIRA)" <ji...@apache.org> on 2011/10/25 18:16:32 UTC

[jira] [Created] (MAPREDUCE-3260) Yarn app stuck in KILL_WAIT state

Yarn app stuck in KILL_WAIT state
---------------------------------

                 Key: MAPREDUCE-3260
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3260
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2, resourcemanager
    Affects Versions: 0.23.0
            Reporter: Todd Lipcon
            Priority: Critical


Last night I killed an MR2 app using "hadoop job -kill". This morning I noticed it's still running, but in "KILL_WAIT" state with no tasks running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3260) Yarn app stuck in KILL_WAIT state

Posted by "Todd Lipcon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135232#comment-13135232 ] 

Todd Lipcon commented on MAPREDUCE-3260:
----------------------------------------

There seem to be some reducers stuck in KILLING state on some of the nodes. The only non-daemon thread is:

{code}
"main" prio=10 tid=0x0000000046f7b800 nid=0x3774 waiting on condition [0x000000004033e000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at java.lang.Thread.sleep(Thread.java:298)
        at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:328)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:117)
{code}

Logs in the NM show the following which looks like a race:
{code}
2011-10-25 00:44:42,842 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1319528200416_0004_01_002409 of type KILL_CONTAINER
2011-10-25 00:44:42,842 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1319528200416_0004_01_002409 transitioned from LOCALIZED to KILLING
2011-10-25 00:44:42,842 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1319528200416_0004_01_002409 of type CONTAINER_LAUNCHED
2011-10-25 00:44:42,860 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Can't handle this event at current state: Current: [KILLING], eventType: [CONTAINER_LAUNCHED]
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_LAUNCHED at KILLING
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:803)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:70)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:373)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:366)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
        at java.lang.Thread.run(Thread.java:619)
2011-10-25 00:44:42,879 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1319528200416_0004_01_002409 transitioned from KILLING to null
{code}
                
> Yarn app stuck in KILL_WAIT state
> ---------------------------------
>
>                 Key: MAPREDUCE-3260
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3260
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> Last night I killed an MR2 app using "hadoop job -kill". This morning I noticed it's still running, but in "KILL_WAIT" state with no tasks running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAPREDUCE-3260) Yarn app stuck in KILL_WAIT state

Posted by "Hitesh Shah (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah resolved MAPREDUCE-3260.
------------------------------------

    Resolution: Duplicate

Duplicate of MR-3084. Being fixed as part of MR-3240. Will push a patch in a bit if you would like to take a look. 
                
> Yarn app stuck in KILL_WAIT state
> ---------------------------------
>
>                 Key: MAPREDUCE-3260
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3260
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> Last night I killed an MR2 app using "hadoop job -kill". This morning I noticed it's still running, but in "KILL_WAIT" state with no tasks running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira