You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Konstantin Shvachko (JIRA)" <ji...@apache.org> on 2007/04/12 20:11:15 UTC

[jira] Created: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Name-node falls into infinite loop trying to remove a dead node.
----------------------------------------------------------------

                 Key: HADOOP-1255
                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.12.3
            Reporter: Konstantin Shvachko
             Fix For: 0.13.0


Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
The data-node dies, and 10 minutes later I get

07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
...................................................
07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077

Here is what I see in the debugger:
FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493745 ] 

dhruba borthakur commented on HADOOP-1255:
------------------------------------------

+1. Code looks good. Let's get this into the next release as soon as possible.

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch, heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488830 ] 

Hairong Kuang commented on HADOOP-1255:
---------------------------------------

Konstantin, this is interesting! I will take a look at how HADOOP-1256 causes the infinite loop after I come back from my vacation.

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490475 ] 

Christian Kunz commented on HADOOP-1255:
----------------------------------------

Just for the record, our namenode servers with release 0.12.3 got into this situation twice, once with a 1000-node cluster, once with a 500-node cluster. In this situation the server spits out 300+ messages per sec and becomes rather unresponsive to DFS clients.

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1255:
----------------------------------

    Attachment: heartbeat.patch

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1255:
----------------------------------

    Status: Patch Available  (was: Open)

Many many thanks to Konstantin who spent time reproducing the infinite loop problem that he got sometime ago. I looked at his case and found out that the loop was caused by HADOOP-1256. Since 1256 already took care of his problem, I am going to mark this patch available.

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch, heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Noguchi updated HADOOP-1255:
---------------------------------

    Priority: Blocker  (was: Major)

Whenever this happens, we have to restart the dfs.

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494363 ] 

Hadoop QA commented on HADOOP-1255:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12356681/heartbeat.patch applied and successfully tested against trunk revision r536239.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/123/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/123/console

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch, heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1255:
----------------------------------

    Attachment: heartbeat.patch

Update the patch to the latest trunk.

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch, heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494480 ] 

Hadoop QA commented on HADOOP-1255:
-----------------------------------

Integrated in Hadoop-Nightly #83 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/83/)

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch, heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1255:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Hairong!

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch, heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488484 ] 

Hairong Kuang commented on HADOOP-1255:
---------------------------------------

After much investigation, I was able to reproduce the problem. This is caused by the same datanode registers more than once. Each registeration puts the datanodeDescriptor in the heartbeat queue. When the heartbeat queue has more than one reference to the same DataNodeDescriptor and the datanode losts a heartbeat, heartbeatCheck will get into an infinite loop. 

This problem could be fixed either by doing a contains check before adding a datanodeDescriptor to the heartbeat queue or using a collection type that disallow duplicate entries for the heartbeat queue.

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>             Fix For: 0.13.0
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488809 ] 

Konstantin Shvachko commented on HADOOP-1255:
---------------------------------------------

I still get infinite loop with this patch.
But HADOOP-1256 fixes it.


> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1255) Name-node falls into infinite loop trying to remove a dead node.

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang reassigned HADOOP-1255:
-------------------------------------

    Assignee: Hairong Kuang

> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1255
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Konstantin Shvachko
>         Assigned To: Hairong Kuang
>             Fix For: 0.13.0
>
>         Attachments: heartbeat.patch
>
>
> Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.