You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Ramya R (JIRA)" <ji...@apache.org> on 2009/05/27 13:20:45 UTC

[jira] Created: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

JT fails to recover the jobs after restart after HADOOP:4372
------------------------------------------------------------

                 Key: HADOOP-5924
                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Ramya R




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

Posted by "Ramya R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713538#action_12713538 ] 

Ramya R commented on HADOOP-5924:
---------------------------------

Submitted a job and restarted the JT after sometime. Below is the snapshot of the JT log:

{noformat}
INFO org.apache.hadoop.mapred.JobTracker: Submitting job <jobID> on behalf of user <user> in groups :<group>
INFO org.apache.hadoop.mapred.JobHistory: Recovered job history filename for job <jobID> is <job history file>
INFO org.apache.hadoop.mapred.JobHistory:  <job history file> exists!
INFO org.apache.hadoop.mapred.JobHistory: <job history file> exists!
INFO org.apache.hadoop.mapred.JobQueuesManager: Job submitted to queue default
WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:<logs>history/<job history file>
Ignoring exception: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:134)
        at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:351)
        at org.apache.hadoop.mapred.JobHistory.parseHistoryFromFS(JobHistory.java:254)
        at org.apache.hadoop.mapred.JobTracker$RecoveryManager.recover(JobTracker.java:1361)
        at org.apache.hadoop.mapred.JobTracker.offerService(JobTracker.java:1850)
        at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3695)
INFO org.apache.hadoop.mapred.JobHistory: Deleting job history file <job history file>
INFO org.apache.hadoop.mapred.JobTracker: Restoration complete
INFO org.apache.hadoop.mapred.JobInitializationPoller: Passing to Initializer Job Id :<jobID> User:<user> Queue : default
INFO org.apache.hadoop.mapred.JobInitializationPoller: Initializing job : <jobID> in Queue default For user : <user>
INFO org.apache.hadoop.mapred.JobInProgress: Initializing <jobID>
INFO org.apache.hadoop.mapred.JobHistory: Nothing to recover for job <jobID>
INFO org.apache.hadoop.mapred.JobInitializationPoller: Job initialization failed:
java.lang.IllegalArgumentException: Can not create a Path from a null string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:78)
        at org.apache.hadoop.fs.Path.<init>(Path.java:90)
        at org.apache.hadoop.fs.Path.<init>(Path.java:45)
        at org.apache.hadoop.mapred.JobHistory$JobInfo.getJobHistoryLogLocation(JobHistory.java:577)
        at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:871)
        at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:405)
        at org.apache.hadoop.mapred.JobInitializationPoller$JobInitializationThread.initializeJobs(JobInitializationPoller.java:143)
        at org.apache.hadoop.mapred.JobInitializationPoller$JobInitializationThread.run(JobInitializationPoller.java:113)
INFO org.apache.hadoop.mapred.JobHistory: Nothing to recover for job <jobID>
INFO org.apache.hadoop.mapred.JobInitializationPoller: Removing killed/completed job from initalized jobs list : <jobID>
{noformat}

The job fails to recover and is marked as failed. This happens for all the jobs(irrespective of map/reduce progress)


> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
>                 Key: HADOOP-5924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Ramya R
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714707#action_12714707 ] 

Amar Kamat commented on HADOOP-5924:
------------------------------------

ant test passed on my box.

> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
>                 Key: HADOOP-5924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Ramya R
>         Attachments: HADOOP-5923-v2.4.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714834#action_12714834 ] 

Devaraj Das commented on HADOOP-5924:
-------------------------------------

Minor nit - the check for an empty killList is redundant and can be removed.

> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
>                 Key: HADOOP-5924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Ramya R
>         Attachments: HADOOP-5923-v2.4.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5924:
-------------------------------

    Attachment: HADOOP-5924-v1.0.patch

Attaching a patch incorporating Devaraj's comments. Result of test-patch 
{code}
[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 9 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
{code}

Note that the patch depends on HADOOP-5908. 

Ant tests passed on my box.

> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
>                 Key: HADOOP-5924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Ramya R
>         Attachments: HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-5924:
------------------------------------

    Attachment: H-5924.20.patch

Attached an alternate version for 0.20 not to be committed to the branch.

> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
>                 Key: HADOOP-5924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Ramya R
>            Assignee: Amar Kamat
>             Fix For: 0.20.1
>
>         Attachments: H-5924.20.patch, HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0-branch20.patch, HADOOP-5924-v1.0.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5924:
-------------------------------

    Attachment: HADOOP-5923-v2.4.patch

Attaching a patch that fixes the issue. Result of test-patch 
{code}
 [exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
{code}

Running ant test now.

> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
>                 Key: HADOOP-5924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Ramya R
>         Attachments: HADOOP-5923-v2.4.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das resolved HADOOP-5924.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.20.1
         Assignee: Amar Kamat
     Hadoop Flags: [Reviewed]

I just committed this. Thanks, Amar!

> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
>                 Key: HADOOP-5924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Ramya R
>            Assignee: Amar Kamat
>             Fix For: 0.20.1
>
>         Attachments: HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0-branch20.patch, HADOOP-5924-v1.0.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5924:
-------------------------------

    Attachment: HADOOP-5924-v1.0-branch20.patch

Attaching a patch for 0.20 branch.

> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
>                 Key: HADOOP-5924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Ramya R
>         Attachments: HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0-branch20.patch, HADOOP-5924-v1.0.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after restart after HADOOP:4372

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5924:
-------------------------------

    Release Note: Post HADOOP-4372, empty job history files caused NPE. This issues fixes that by creating new files if no old file is found.

> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
>                 Key: HADOOP-5924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5924
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Ramya R
>            Assignee: Amar Kamat
>             Fix For: 0.20.1
>
>         Attachments: H-5924.20.patch, HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0-branch20.patch, HADOOP-5924-v1.0.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.