You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Tim Nelson <ha...@enigmasecurity.com> on 2008/03/11 04:37:36 UTC

zombie data nodes, not alive but not dead

I've got to be doing something stupid cause I can't find any mention of 
others having this problem. Here's what's happening. I had a cluster of 
nine nodes running (1 namenode and 8 datanodes) the 0.15.3 release. I've 
been running the mapred samples and reformatting filesystems, just 
getting comfortable with the software. When I upgraded to 0.16.0 release 
I reformatted (mke2fs) all of my data partitions (including the namenode 
data). I ran a "hadoop namenode -format" which ran fine. Then I brought 
them back up, the only slave to connect to the master was the master 
itself acting as a datanode. The dfs daemon was started on the slave 
nodes but it just doesn't seem to connect to the master.

I know that the slaves are doing *something* with the master because if 
i start them before the namenode is running then I get lots of log 
messages about attempting to reconnect. Below is my site config and logs 
from the namenode and a "zombie" datanode.

**** hadoop-site.xml (same across all nodes) ****
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/sda1/hadoop-datastore-0.15.3/hadoop-${user.name}</value>
  <description>...</description>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://head00:54310</value>
  <description>...</description>
</property>
<property>
  <name>mapred.job.tracker</name>
  <value>head00:54311</value>
  <description>...</description>
</property>
<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>...</description>
</property>
</configuration>


**** namenode log ****
2008-03-10 19:32:53,186 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = head00/192.168.16.48
STARTUP_MSG:   args = []
************************************************************/
2008-03-10 19:32:54,260 INFO org.apache.hadoop.dfs.NameNode: Namenode up 
at: head00/192.168.16.48:54310
2008-03-10 19:32:54,267 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-03-10 19:32:54,628 INFO org.apache.hadoop.dfs.StateChange: STATE* 
Network topology has 0 racks and 0 datanodes
2008-03-10 19:32:54,629 INFO org.apache.hadoop.dfs.StateChange: STATE* 
UnderReplicatedBlocks has 0 blocks
2008-03-10 19:32:55,421 INFO org.mortbay.util.Credential: Checking 
Resource aliases
2008-03-10 19:32:56,048 INFO org.mortbay.http.HttpServer: Version 
Jetty/5.1.4
2008-03-10 19:32:56,051 INFO org.mortbay.util.Container: Started 
HttpContext[/static,/static]
2008-03-10 19:32:56,052 INFO org.mortbay.util.Container: Started 
HttpContext[/logs,/logs]
2008-03-10 19:32:57,493 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.WebApplicationHandler@30d082
2008-03-10 19:32:57,826 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]
2008-03-10 19:32:58,112 INFO org.mortbay.http.SocketListener: Started 
SocketListener on 0.0.0.0:50070
2008-03-10 19:32:58,112 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.Server@18e2b22
2008-03-10 19:32:58,112 INFO org.apache.hadoop.fs.FSNamesystem: 
Web-server up at: 50070
2008-03-10 19:32:58,116 INFO org.apache.hadoop.ipc.Server: IPC Server 
listener on 54310: starting
2008-03-10 19:32:58,139 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 0 on 54310: starting
2008-03-10 19:32:58,140 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 1 on 54310: starting
...
2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 9 on 54310: starting
2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 4 on 54310: starting
2008-03-10 19:33:02,684 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
NameSystem.registerDatanode: node registration from 192.168.16.48:50010 
storage DS-437400207-192.168.16.48-50010-1205199182672   ***** this is 
the namenode connecting to itself as a data node *****
2008-03-10 19:33:02,693 INFO org.apache.hadoop.net.NetworkTopology: 
Adding a new node: /default-rack/192.168.16.48:50010
2008-03-10 19:38:01,248 INFO org.apache.hadoop.fs.FSNamesystem: Roll 
Edit Log from 192.168.16.48
2008-03-10 19:38:01,249 INFO org.apache.hadoop.fs.FSNamesystem: Number 
of transactions: 0 Total time for transactions(ms): 0 Number of syncs: 0 
SyncTimes(ms): 0
2008-03-10 19:43:02,227 INFO org.apache.hadoop.fs.FSNamesystem: Roll 
Edit Log from 192.168.16.48
2008-03-10 19:48:02,374 INFO org.apache.hadoop.fs.FSNamesystem: Roll 
Edit Log from 192.168.16.48

***** datanode log *****
2008-03-10 20:30:31,392 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = node05/192.168.16.55
STARTUP_MSG:   args = []
************************************************************/
2008-03-10 20:30:31,786 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=DataNode, sessionId=null
2008-03-10 20:30:32,000 INFO org.apache.hadoop.ipc.Client: Retrying 
connect to server: head00/192.168.16.48:54310. Already tried 1 time(s).
2008-03-10 20:30:33,006 INFO org.apache.hadoop.ipc.Client: Retrying 
connect to server: head00/192.168.16.48:54310. Already tried 2 time(s).
...
2008-03-10 20:31:30,244 INFO org.apache.hadoop.ipc.Client: Retrying 
connect to server: head00/192.168.16.48:54310. Already tried 4 time(s).
2008-03-10 20:31:31,250 INFO org.apache.hadoop.ipc.Client: Retrying 
connect to server: head00/192.168.16.48:54310. Already tried 5 time(s).
2008-03-10 20:31:36,130 INFO org.apache.hadoop.dfs.Storage: Storage 
directory /mnt/sda1/hadoop-datastore-0.15.3/hadoop-hadoop/dfs/data is 
not formatted.
2008-03-10 20:31:36,131 INFO org.apache.hadoop.dfs.Storage: Formatting ...
2008-03-10 20:31:36,196 INFO org.apache.hadoop.dfs.DataNode: Opened 
server at 50010
2008-03-10 20:31:36,419 INFO org.mortbay.util.Credential: Checking 
Resource aliases
2008-03-10 20:31:36,563 INFO org.mortbay.http.HttpServer: Version 
Jetty/5.1.4
2008-03-10 20:31:36,566 INFO org.mortbay.util.Container: Started 
HttpContext[/static,/static]
2008-03-10 20:31:36,566 INFO org.mortbay.util.Container: Started 
HttpContext[/logs,/logs]
2008-03-10 20:31:37,553 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.WebApplicationHandler@1dc0e7a
2008-03-10 20:31:37,714 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]
2008-03-10 20:31:37,730 INFO org.mortbay.http.SocketListener: Started 
SocketListener on 0.0.0.0:50075
2008-03-10 20:31:37,730 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.Server@f8f7db

I can leave the cluster running  for hours and this slave will never 
"register" itself with the namenode. I've been messing with this problem 
for three days now and I'm out of ideas. Any suggestions?

Regards,
Tim Nelson

RE: zombie data nodes, not alive but not dead

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.

The following issues might be impacting you (from release notes)

http://issues.apache.org/jira/browse/HADOOP-2185

    HADOOP-2185.  RPC Server uses any available port if the specified
    port is zero. Otherwise it uses the specified port. Also combines
    the configuration attributes for the servers' bind address and
    port from "x.x.x.x" and "y" to "x.x.x.x:y".
    Deprecated configuration variables:
      dfs.info.bindAddress
      dfs.info.port
      dfs.datanode.bindAddress
      dfs.datanode.port
      dfs.datanode.info.bindAdress
      dfs.datanode.info.port
      dfs.secondary.info.bindAddress
      dfs.secondary.info.port
      mapred.job.tracker.info.bindAddress
      mapred.job.tracker.info.port
      mapred.task.tracker.report.bindAddress
      tasktracker.http.bindAddress
      tasktracker.http.port
    New configuration variables (post HADOOP-2404):
      dfs.secondary.http.address
      dfs.datanode.address
      dfs.datanode.http.address
      dfs.http.address
      mapred.job.tracker.http.address
      mapred.task.tracker.report.address
      mapred.task.tracker.http.address

-----Original Message-----
From: Dave Coyle [mailto:home+hadoop@davecoyle.com] 
Sent: Monday, March 10, 2008 10:01 PM
To: core-user@hadoop.apache.org
Subject: Re: zombie data nodes, not alive but not dead

On 2008-03-10 23:37:36 -0400, hadoop@enigmasecurity.com wrote:
> I can leave the cluster running  for hours and this slave will never 
> "register" itself with the namenode. I've been messing with this
problem 
> for three days now and I'm out of ideas. Any suggestions?

I had a similar-sounding problem with a 0.16.0 setup I had...
namenode thinks datanodes are dead, but the datanodes complain if
namenode is unreachable so there must be *some* connectivity.
Admittedly I haven't had the time yet to recreate what I did to see if
I had just mangled some config somewhere, but I was eventually able to
sort out my problem by...and yes, this sounds a bit wacky... running
a given datanode interactively, suspending it, then bringing it back
to the foreground.  E.g. (assuming your namenode is already running):

    $ bin/hadoop datanode
    <ctrl-Z>
    $ fg

and the datanode then magically registered with the namenode.

Give it a shot... I'm curious to hear if it works for you, too.

-Coyle

Re: zombie data nodes, not alive but not dead

Posted by Dave Coyle <ho...@davecoyle.com>.

On 2008-03-10 23:37:36 -0400, hadoop@enigmasecurity.com wrote:
> I can leave the cluster running  for hours and this slave will never 
> "register" itself with the namenode. I've been messing with this problem 
> for three days now and I'm out of ideas. Any suggestions?

I had a similar-sounding problem with a 0.16.0 setup I had...
namenode thinks datanodes are dead, but the datanodes complain if
namenode is unreachable so there must be *some* connectivity.
Admittedly I haven't had the time yet to recreate what I did to see if
I had just mangled some config somewhere, but I was eventually able to
sort out my problem by...and yes, this sounds a bit wacky... running
a given datanode interactively, suspending it, then bringing it back
to the foreground.  E.g. (assuming your namenode is already running):

    $ bin/hadoop datanode
    <ctrl-Z>
    $ fg

and the datanode then magically registered with the namenode.

Give it a shot... I'm curious to hear if it works for you, too.

-Coyle