You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2007/05/04 19:06:19 UTC

[jira] Commented: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

    [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493745 ] 

dhruba borthakur commented on HADOOP-1255:
------------------------------------------

+1. Code looks good. Let's get this into the next release as soon as possible.

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch, heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.