You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Ramya R (JIRA)" <ji...@apache.org> on 2009/05/27 13:20:45 UTC
[jira] Created: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
JT fails to recover the jobs after restart after HADOOP:4372
------------------------------------------------------------
Key: HADOOP-5924
URL: https://issues.apache.org/jira/browse/HADOOP-5924
Project: Hadoop Core
Issue Type: Bug
Reporter: Ramya R
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
Posted by "Ramya R (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713538#action_12713538 ]
Ramya R commented on HADOOP-5924:
---------------------------------
Submitted a job and restarted the JT after sometime. Below is the snapshot of the JT log:
{noformat}
INFO org.apache.hadoop.mapred.JobTracker: Submitting job <jobID> on behalf of user <user> in groups :<group>
INFO org.apache.hadoop.mapred.JobHistory: Recovered job history filename for job <jobID> is <job history file>
INFO org.apache.hadoop.mapred.JobHistory: <job history file> exists!
INFO org.apache.hadoop.mapred.JobHistory: <job history file> exists!
INFO org.apache.hadoop.mapred.JobQueuesManager: Job submitted to queue default
WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:<logs>history/<job history file>
Ignoring exception: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:134)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:351)
at org.apache.hadoop.mapred.JobHistory.parseHistoryFromFS(JobHistory.java:254)
at org.apache.hadoop.mapred.JobTracker$RecoveryManager.recover(JobTracker.java:1361)
at org.apache.hadoop.mapred.JobTracker.offerService(JobTracker.java:1850)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3695)
INFO org.apache.hadoop.mapred.JobHistory: Deleting job history file <job history file>
INFO org.apache.hadoop.mapred.JobTracker: Restoration complete
INFO org.apache.hadoop.mapred.JobInitializationPoller: Passing to Initializer Job Id :<jobID> User:<user> Queue : default
INFO org.apache.hadoop.mapred.JobInitializationPoller: Initializing job : <jobID> in Queue default For user : <user>
INFO org.apache.hadoop.mapred.JobInProgress: Initializing <jobID>
INFO org.apache.hadoop.mapred.JobHistory: Nothing to recover for job <jobID>
INFO org.apache.hadoop.mapred.JobInitializationPoller: Job initialization failed:
java.lang.IllegalArgumentException: Can not create a Path from a null string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:78)
at org.apache.hadoop.fs.Path.<init>(Path.java:90)
at org.apache.hadoop.fs.Path.<init>(Path.java:45)
at org.apache.hadoop.mapred.JobHistory$JobInfo.getJobHistoryLogLocation(JobHistory.java:577)
at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:871)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:405)
at org.apache.hadoop.mapred.JobInitializationPoller$JobInitializationThread.initializeJobs(JobInitializationPoller.java:143)
at org.apache.hadoop.mapred.JobInitializationPoller$JobInitializationThread.run(JobInitializationPoller.java:113)
INFO org.apache.hadoop.mapred.JobHistory: Nothing to recover for job <jobID>
INFO org.apache.hadoop.mapred.JobInitializationPoller: Removing killed/completed job from initalized jobs list : <jobID>
{noformat}
The job fails to recover and is marked as failed. This happens for all the jobs(irrespective of map/reduce progress)
> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
> Key: HADOOP-5924
> URL: https://issues.apache.org/jira/browse/HADOOP-5924
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Ramya R
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714707#action_12714707 ]
Amar Kamat commented on HADOOP-5924:
------------------------------------
ant test passed on my box.
> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
> Key: HADOOP-5924
> URL: https://issues.apache.org/jira/browse/HADOOP-5924
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Ramya R
> Attachments: HADOOP-5923-v2.4.patch
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714834#action_12714834 ]
Devaraj Das commented on HADOOP-5924:
-------------------------------------
Minor nit - the check for an empty killList is redundant and can be removed.
> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
> Key: HADOOP-5924
> URL: https://issues.apache.org/jira/browse/HADOOP-5924
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Ramya R
> Attachments: HADOOP-5923-v2.4.patch
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amar Kamat updated HADOOP-5924:
-------------------------------
Attachment: HADOOP-5924-v1.0.patch
Attaching a patch incorporating Devaraj's comments. Result of test-patch
{code}
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 9 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
{code}
Note that the patch depends on HADOOP-5908.
Ant tests passed on my box.
> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
> Key: HADOOP-5924
> URL: https://issues.apache.org/jira/browse/HADOOP-5924
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Ramya R
> Attachments: HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0.patch
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Chansler updated HADOOP-5924:
------------------------------------
Attachment: H-5924.20.patch
Attached an alternate version for 0.20 not to be committed to the branch.
> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
> Key: HADOOP-5924
> URL: https://issues.apache.org/jira/browse/HADOOP-5924
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Ramya R
> Assignee: Amar Kamat
> Fix For: 0.20.1
>
> Attachments: H-5924.20.patch, HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0-branch20.patch, HADOOP-5924-v1.0.patch
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amar Kamat updated HADOOP-5924:
-------------------------------
Attachment: HADOOP-5923-v2.4.patch
Attaching a patch that fixes the issue. Result of test-patch
{code}
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
{code}
Running ant test now.
> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
> Key: HADOOP-5924
> URL: https://issues.apache.org/jira/browse/HADOOP-5924
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Ramya R
> Attachments: HADOOP-5923-v2.4.patch
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devaraj Das resolved HADOOP-5924.
---------------------------------
Resolution: Fixed
Fix Version/s: 0.20.1
Assignee: Amar Kamat
Hadoop Flags: [Reviewed]
I just committed this. Thanks, Amar!
> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
> Key: HADOOP-5924
> URL: https://issues.apache.org/jira/browse/HADOOP-5924
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Ramya R
> Assignee: Amar Kamat
> Fix For: 0.20.1
>
> Attachments: HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0-branch20.patch, HADOOP-5924-v1.0.patch
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amar Kamat updated HADOOP-5924:
-------------------------------
Attachment: HADOOP-5924-v1.0-branch20.patch
Attaching a patch for 0.20 branch.
> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
> Key: HADOOP-5924
> URL: https://issues.apache.org/jira/browse/HADOOP-5924
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Ramya R
> Attachments: HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0-branch20.patch, HADOOP-5924-v1.0.patch
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-5924) JT fails to recover the jobs after
restart after HADOOP:4372
Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amar Kamat updated HADOOP-5924:
-------------------------------
Release Note: Post HADOOP-4372, empty job history files caused NPE. This issues fixes that by creating new files if no old file is found.
> JT fails to recover the jobs after restart after HADOOP:4372
> ------------------------------------------------------------
>
> Key: HADOOP-5924
> URL: https://issues.apache.org/jira/browse/HADOOP-5924
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Ramya R
> Assignee: Amar Kamat
> Fix For: 0.20.1
>
> Attachments: H-5924.20.patch, HADOOP-5923-v2.4.patch, HADOOP-5924-v1.0-branch20.patch, HADOOP-5924-v1.0.patch
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.