You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Chelsey Chang (JIRA)" <ji...@apache.org> on 2013/07/19 23:24:48 UTC

[jira] [Commented] (MAPREDUCE-5406) Task Tracker exiting with JVM manager inconsistent state

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714090#comment-13714090 ] 

Chelsey Chang commented on MAPREDUCE-5406:
------------------------------------------

Looking at the logs, it seems like inconsistent state means that # of free slots (in TaskLauncher) was out of sync with # of active JVMs (in JVMManager).
Task attempt_201306080400_104719_m_000000_0 was killed around 2013-06-09 06:39:55,472 and released a slot; but when the InconsistentStateException was thrown, it was still running on one of the active JVMs. This is the only task (out of the 8 tasks in the error message) that looks suspcious. After some investigation and discussion with Ivan and Chuan, we think that the inconsistent state may be caused by:
1. Race condition between task finishing up and task being killed
2. Killing JVM didn't get executed properly
So I added some log messages that can help us identify where the inconsistency happens if we see this again.
                
> Task Tracker exiting with JVM manager inconsistent state
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-5406
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5406
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Chelsey Chang
>            Assignee: Chelsey Chang
>         Attachments: hadoop-tasktracker-RD00155D61582F-short.log, MAPREDUCE-5406.branch-1-win.1.patch
>
>
> Looks like we are reaching JVM manager inconsistent state which cases TT to crash:
> {code}
> 2013-06-09 06:41:11,250 FATAL org.apache.hadoop.mapred.JvmManager: Inconsistent state!!! JVM Manager reached an unstable state while reaping a JVM for task: attempt_201306080400_104812_m_000001_0 Number of active JVMs:8
>   JVMId jvm_201306080400_104517_m_1331138312 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104517_m_000001_0
>   JVMId jvm_201306080400_104641_m_-1631395161 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104641_m_000000_0
>   JVMId jvm_201306080400_104494_m_-1702464703 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104494_m_000000_0
>   JVMId jvm_201306080400_104784_m_1407576088 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104784_m_000000_0
>   JVMId jvm_201306080400_104530_m_186665365 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104530_m_000000_0
>   JVMId jvm_201306080400_104589_m_-1080246077 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104589_m_000000_0
>   JVMId jvm_201306080400_104674_m_830017814 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104674_m_000000_0
>   JVMId jvm_201306080400_104719_m_-226910128 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104719_m_000000_0. Aborting. 
> 2013-06-09 06:41:11,250 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: 
> {code}
> Although this causes TT to crash, the frequency of the error is rare and the error itself is recoverable so the priority of the issue is not high.
> However, this does look like a bug in the JVM manager state machine. I'm guessing there is some race condition that we're hitting.
> (Logs attached)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira