You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2008/10/06 18:27:47 UTC

[jira] Created: (HADOOP-4352) a job stays in running state forever, even though all the tasks completed a long time ago

a job stays in running state forever, even though all the tasks completed a long time ago
-----------------------------------------------------------------------------------------

                 Key: HADOOP-4352
                 URL: https://issues.apache.org/jira/browse/HADOOP-4352
             Project: Hadoop Core
          Issue Type: Bug
    Affects Versions: 0.17.2
            Reporter: Runping Qi



I encountered a job  that stays in running state forever, even though all the tasks completed a long time ago.
The last lines in the job tracker log complain that it cannot connect to the namenode of the dfs, although the dfs namenode works fine at present time.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4352) a job stays in running state forever, even though all the tasks completed a long time ago

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637994#action_12637994 ] 

Runping Qi commented on HADOOP-4352:
------------------------------------


The problem may be due to the following exception logged in jt log:

2008-09-09 04:06:30,968 ERROR org.apache.hadoop.mapred.JobTracker: Task Commit Thread got an exception:
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
        at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:339)
        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
        at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:100)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
        at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:263)
        at sun.nio.cs.StreamEncoder.write(


> a job stays in running state forever, even though all the tasks completed a long time ago
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4352
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4352
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.17.2
>            Reporter: Runping Qi
>         Attachments: jobtracker_jstatck_trace.out
>
>
> I encountered a job  that stays in running state forever, even though all the tasks completed a long time ago.
> The last lines in the job tracker log complain that it cannot connect to the namenode of the dfs, although the dfs namenode works fine at present time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4352) a job stays in running state forever, even though all the tasks completed a long time ago

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637356#action_12637356 ] 

amareshwari edited comment on HADOOP-4352 at 10/6/08 8:49 PM:
--------------------------------------------------------------------------

Runping, Can you paste the last lines from the JT log (including the last task) where it is complaining that it cannot connect to namenode.  Is it complaining in garbageCollect()?

      was (Author: amareshwari):
    Runping, Can you paste the lines from the JT log where it is complaining that it cannot to namenode.  Is it in garbageCollect()?
  
> a job stays in running state forever, even though all the tasks completed a long time ago
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4352
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4352
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.17.2
>            Reporter: Runping Qi
>
> I encountered a job  that stays in running state forever, even though all the tasks completed a long time ago.
> The last lines in the job tracker log complain that it cannot connect to the namenode of the dfs, although the dfs namenode works fine at present time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4352) a job stays in running state forever, even though all the tasks completed a long time ago

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637839#action_12637839 ] 

Devaraj Das commented on HADOOP-4352:
-------------------------------------

Looked offline at a JT which was in this state. It looks like, due to an exception while updating the job history file (FSError exception) during heartbeat processing, the JT couldn't process the heartbeat completely leading to this inconsistent state. Maybe we should move to a model where we queue up everything to be written to the disk/dfs (mainly when a task is launched/succeeded/failed/killed), and let the JT deal with just in-memory datastructures during heartbeat processing.. That will lead to faster heartbeat processing as well.

> a job stays in running state forever, even though all the tasks completed a long time ago
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4352
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4352
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.17.2
>            Reporter: Runping Qi
>         Attachments: jobtracker_jstatck_trace.out
>
>
> I encountered a job  that stays in running state forever, even though all the tasks completed a long time ago.
> The last lines in the job tracker log complain that it cannot connect to the namenode of the dfs, although the dfs namenode works fine at present time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4352) a job stays in running state forever, even though all the tasks completed a long time ago

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637356#action_12637356 ] 

Amareshwari Sriramadasu commented on HADOOP-4352:
-------------------------------------------------

Runping, Can you paste the lines from the JT log where it is complaining that it cannot to namenode.  Is it in garbageCollect()?

> a job stays in running state forever, even though all the tasks completed a long time ago
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4352
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4352
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.17.2
>            Reporter: Runping Qi
>
> I encountered a job  that stays in running state forever, even though all the tasks completed a long time ago.
> The last lines in the job tracker log complain that it cannot connect to the namenode of the dfs, although the dfs namenode works fine at present time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4352) a job stays in running state forever, even though all the tasks completed a long time ago

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated HADOOP-4352:
-------------------------------

    Attachment: jobtracker_jstatck_trace.out

Java Stack trace for the JobTracker process

> a job stays in running state forever, even though all the tasks completed a long time ago
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4352
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4352
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.17.2
>            Reporter: Runping Qi
>         Attachments: jobtracker_jstatck_trace.out
>
>
> I encountered a job  that stays in running state forever, even though all the tasks completed a long time ago.
> The last lines in the job tracker log complain that it cannot connect to the namenode of the dfs, although the dfs namenode works fine at present time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4352) a job stays in running state forever, even though all the tasks completed a long time ago

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637922#action_12637922 ] 

Vinod K V commented on HADOOP-4352:
-----------------------------------

We can queue DFS writes, but still might wish to have disk writes inline for the sake of immediate availability of JobHistory for JobRecovery. Or in the minimum, we should give higher priority to disk writes while queuing JobHistory writes.

> a job stays in running state forever, even though all the tasks completed a long time ago
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4352
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4352
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.17.2
>            Reporter: Runping Qi
>         Attachments: jobtracker_jstatck_trace.out
>
>
> I encountered a job  that stays in running state forever, even though all the tasks completed a long time ago.
> The last lines in the job tracker log complain that it cannot connect to the namenode of the dfs, although the dfs namenode works fine at present time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.