You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Yoram Arnon (JIRA)" <ji...@apache.org> on 2006/09/05 17:26:22 UTC

[jira] Created: (HADOOP-506) job tracker hangs on to dead task trackers "forever"

job tracker hangs on to dead task trackers "forever"
----------------------------------------------------

                 Key: HADOOP-506
                 URL: http://issues.apache.org/jira/browse/HADOOP-506
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
            Reporter: Yoram Arnon
            Priority: Minor


I see cases where a task tracker gets disconnected from the job tracker and disconnects, and then appears twice in the job tracker's list, with one instance being alive and well, and the other's 'time since last heartbeat' increasing monotonically.
that all makes sense.
What doesn't make sense, is that the old instances never expire. It's been over 400000 seoncds since the last heartbeat. And the cluster reports having more nodes up and running than its size (350 nodes in a 320 node cluster).

there should be some reasonable timout for these expired task trackers, somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-506) job tracker hangs on to dead task trackers "forever"

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-506?page=all ]

Doug Cutting updated HADOOP-506:
--------------------------------

    Fix Version/s: 0.8.0
                       (was: 0.7.0)

> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
>                 Key: HADOOP-506
>                 URL: http://issues.apache.org/jira/browse/HADOOP-506
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Yoram Arnon
>         Assigned To: Sanjay Dahiya
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: Hadoop-506.patch
>
>
> I see cases where a task tracker gets disconnected from the job tracker and disconnects, and then appears twice in the job tracker's list, with one instance being alive and well, and the other's 'time since last heartbeat' increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been over 400000 seoncds since the last heartbeat. And the cluster reports having more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers, somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-506) job tracker hangs on to dead task trackers "forever"

Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-506?page=all ]

Sanjay Dahiya updated HADOOP-506:
---------------------------------

           Status: Patch Available  (was: In Progress)
    Fix Version/s: 0.7.0

> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
>                 Key: HADOOP-506
>                 URL: http://issues.apache.org/jira/browse/HADOOP-506
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Yoram Arnon
>         Assigned To: Sanjay Dahiya
>            Priority: Minor
>             Fix For: 0.7.0
>
>         Attachments: Hadoop-506.patch
>
>
> I see cases where a task tracker gets disconnected from the job tracker and disconnects, and then appears twice in the job tracker's list, with one instance being alive and well, and the other's 'time since last heartbeat' increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been over 400000 seoncds since the last heartbeat. And the cluster reports having more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers, somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Assigned: (HADOOP-506) job tracker hangs on to dead task trackers "forever"

Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-506?page=all ]

Sanjay Dahiya reassigned HADOOP-506:
------------------------------------

    Assignee: Sanjay Dahiya

> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
>                 Key: HADOOP-506
>                 URL: http://issues.apache.org/jira/browse/HADOOP-506
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Yoram Arnon
>         Assigned To: Sanjay Dahiya
>            Priority: Minor
>
> I see cases where a task tracker gets disconnected from the job tracker and disconnects, and then appears twice in the job tracker's list, with one instance being alive and well, and the other's 'time since last heartbeat' increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been over 400000 seoncds since the last heartbeat. And the cluster reports having more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers, somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-506) job tracker hangs on to dead task trackers "forever"

Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-506?page=comments#action_12440217 ] 
            
Sanjay Dahiya commented on HADOOP-506:
--------------------------------------

One case in which I am able to reproduce this repeatedly is that job tracker restarts when tasktrackers are still running. basically UNKNOWN_TASKTRACKER status messages are not handled properly in job tracker. 

here is what happens in that case - 

synchronized (taskTrackers) {
    synchronized (trackerExpiryQueue) {
        boolean seenBefore = updateTaskTrackerStatus(trackerName,
                                                     trackerStatus);
        if (initialContact) {		                                        <<<<==== This is false 
            // If it's first contact, then clear out any state hanging around
            if (seenBefore) {		
                lostTaskTracker(trackerName, trackerStatus.getHost());
            }
        } else {
            // If not first contact, there should be some record of the tracker
            if (!seenBefore) {
                return InterTrackerProtocol.UNKNOWN_TASKTRACKER;    
                                                  <<<<=== returns this, but TT already in tasktrackers and not in expiryQueue
            }
        }

        if (initialContact) {
            trackerExpiryQueue.add(trackerStatus);	                         <<<<==== not called 
        }
    }
}


in updateTaskTrackerStatus if (oldStatus == null && initialContact == false ) then its a rogue status and should not be added to tasktrackers map in job tracker. 

I am investigating if this can happen in some other condition as well.


> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
>                 Key: HADOOP-506
>                 URL: http://issues.apache.org/jira/browse/HADOOP-506
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Yoram Arnon
>         Assigned To: Sanjay Dahiya
>            Priority: Minor
>
> I see cases where a task tracker gets disconnected from the job tracker and disconnects, and then appears twice in the job tracker's list, with one instance being alive and well, and the other's 'time since last heartbeat' increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been over 400000 seoncds since the last heartbeat. And the cluster reports having more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers, somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-506) job tracker hangs on to dead task trackers "forever"

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-506?page=comments#action_12440267 ] 
            
Doug Cutting commented on HADOOP-506:
-------------------------------------

I just committed this.  Thanks, Sanjay!

> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
>                 Key: HADOOP-506
>                 URL: http://issues.apache.org/jira/browse/HADOOP-506
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Yoram Arnon
>         Assigned To: Sanjay Dahiya
>            Priority: Minor
>             Fix For: 0.7.0
>
>         Attachments: Hadoop-506.patch
>
>
> I see cases where a task tracker gets disconnected from the job tracker and disconnects, and then appears twice in the job tracker's list, with one instance being alive and well, and the other's 'time since last heartbeat' increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been over 400000 seoncds since the last heartbeat. And the cluster reports having more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers, somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-506) job tracker hangs on to dead task trackers "forever"

Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-506?page=all ]

Sanjay Dahiya updated HADOOP-506:
---------------------------------

    Attachment: Hadoop-506.patch

This patch removes tasktracker from Map maintained by JobTracker if its a UNKNOWN_TASKTRACKER tracker. 


> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
>                 Key: HADOOP-506
>                 URL: http://issues.apache.org/jira/browse/HADOOP-506
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Yoram Arnon
>         Assigned To: Sanjay Dahiya
>            Priority: Minor
>         Attachments: Hadoop-506.patch
>
>
> I see cases where a task tracker gets disconnected from the job tracker and disconnects, and then appears twice in the job tracker's list, with one instance being alive and well, and the other's 'time since last heartbeat' increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been over 400000 seoncds since the last heartbeat. And the cluster reports having more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers, somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-506) job tracker hangs on to dead task trackers "forever"

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-506?page=all ]

Owen O'Malley updated HADOOP-506:
---------------------------------

           Status: Resolved  (was: Patch Available)
    Fix Version/s: 0.7.0
                       (was: 0.8.0)
       Resolution: Fixed

> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
>                 Key: HADOOP-506
>                 URL: http://issues.apache.org/jira/browse/HADOOP-506
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Yoram Arnon
>         Assigned To: Sanjay Dahiya
>            Priority: Minor
>             Fix For: 0.7.0
>
>         Attachments: Hadoop-506.patch
>
>
> I see cases where a task tracker gets disconnected from the job tracker and disconnects, and then appears twice in the job tracker's list, with one instance being alive and well, and the other's 'time since last heartbeat' increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been over 400000 seoncds since the last heartbeat. And the cluster reports having more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers, somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Work started: (HADOOP-506) job tracker hangs on to dead task trackers "forever"

Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-506?page=all ]

Work on HADOOP-506 started by Sanjay Dahiya.

> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
>                 Key: HADOOP-506
>                 URL: http://issues.apache.org/jira/browse/HADOOP-506
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Yoram Arnon
>         Assigned To: Sanjay Dahiya
>            Priority: Minor
>
> I see cases where a task tracker gets disconnected from the job tracker and disconnects, and then appears twice in the job tracker's list, with one instance being alive and well, and the other's 'time since last heartbeat' increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been over 400000 seoncds since the last heartbeat. And the cluster reports having more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers, somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira