You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tim Nelson <ha...@enigmasecurity.com> on 2008/03/11 04:37:36 UTC
zombie data nodes, not alive but not dead
I've got to be doing something stupid cause I can't find any mention of
others having this problem. Here's what's happening. I had a cluster of
nine nodes running (1 namenode and 8 datanodes) the 0.15.3 release. I've
been running the mapred samples and reformatting filesystems, just
getting comfortable with the software. When I upgraded to 0.16.0 release
I reformatted (mke2fs) all of my data partitions (including the namenode
data). I ran a "hadoop namenode -format" which ran fine. Then I brought
them back up, the only slave to connect to the master was the master
itself acting as a datanode. The dfs daemon was started on the slave
nodes but it just doesn't seem to connect to the master.
I know that the slaves are doing *something* with the master because if
i start them before the namenode is running then I get lots of log
messages about attempting to reconnect. Below is my site config and logs
from the namenode and a "zombie" datanode.
**** hadoop-site.xml (same across all nodes) ****
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/mnt/sda1/hadoop-datastore-0.15.3/hadoop-${user.name}</value>
<description>...</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://head00:54310</value>
<description>...</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>head00:54311</value>
<description>...</description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>...</description>
</property>
</configuration>
**** namenode log ****
2008-03-10 19:32:53,186 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = head00/192.168.16.48
STARTUP_MSG: args = []
************************************************************/
2008-03-10 19:32:54,260 INFO org.apache.hadoop.dfs.NameNode: Namenode up
at: head00/192.168.16.48:54310
2008-03-10 19:32:54,267 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-03-10 19:32:54,628 INFO org.apache.hadoop.dfs.StateChange: STATE*
Network topology has 0 racks and 0 datanodes
2008-03-10 19:32:54,629 INFO org.apache.hadoop.dfs.StateChange: STATE*
UnderReplicatedBlocks has 0 blocks
2008-03-10 19:32:55,421 INFO org.mortbay.util.Credential: Checking
Resource aliases
2008-03-10 19:32:56,048 INFO org.mortbay.http.HttpServer: Version
Jetty/5.1.4
2008-03-10 19:32:56,051 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]
2008-03-10 19:32:56,052 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]
2008-03-10 19:32:57,493 INFO org.mortbay.util.Container: Started
org.mortbay.jetty.servlet.WebApplicationHandler@30d082
2008-03-10 19:32:57,826 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]
2008-03-10 19:32:58,112 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50070
2008-03-10 19:32:58,112 INFO org.mortbay.util.Container: Started
org.mortbay.jetty.Server@18e2b22
2008-03-10 19:32:58,112 INFO org.apache.hadoop.fs.FSNamesystem:
Web-server up at: 50070
2008-03-10 19:32:58,116 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 54310: starting
2008-03-10 19:32:58,139 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 54310: starting
2008-03-10 19:32:58,140 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54310: starting
...
2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 54310: starting
2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 54310: starting
2008-03-10 19:33:02,684 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration from 192.168.16.48:50010
storage DS-437400207-192.168.16.48-50010-1205199182672 ***** this is
the namenode connecting to itself as a data node *****
2008-03-10 19:33:02,693 INFO org.apache.hadoop.net.NetworkTopology:
Adding a new node: /default-rack/192.168.16.48:50010
2008-03-10 19:38:01,248 INFO org.apache.hadoop.fs.FSNamesystem: Roll
Edit Log from 192.168.16.48
2008-03-10 19:38:01,249 INFO org.apache.hadoop.fs.FSNamesystem: Number
of transactions: 0 Total time for transactions(ms): 0 Number of syncs: 0
SyncTimes(ms): 0
2008-03-10 19:43:02,227 INFO org.apache.hadoop.fs.FSNamesystem: Roll
Edit Log from 192.168.16.48
2008-03-10 19:48:02,374 INFO org.apache.hadoop.fs.FSNamesystem: Roll
Edit Log from 192.168.16.48
***** datanode log *****
2008-03-10 20:30:31,392 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG: host = node05/192.168.16.55
STARTUP_MSG: args = []
************************************************************/
2008-03-10 20:30:31,786 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=DataNode, sessionId=null
2008-03-10 20:30:32,000 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: head00/192.168.16.48:54310. Already tried 1 time(s).
2008-03-10 20:30:33,006 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: head00/192.168.16.48:54310. Already tried 2 time(s).
...
2008-03-10 20:31:30,244 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: head00/192.168.16.48:54310. Already tried 4 time(s).
2008-03-10 20:31:31,250 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: head00/192.168.16.48:54310. Already tried 5 time(s).
2008-03-10 20:31:36,130 INFO org.apache.hadoop.dfs.Storage: Storage
directory /mnt/sda1/hadoop-datastore-0.15.3/hadoop-hadoop/dfs/data is
not formatted.
2008-03-10 20:31:36,131 INFO org.apache.hadoop.dfs.Storage: Formatting ...
2008-03-10 20:31:36,196 INFO org.apache.hadoop.dfs.DataNode: Opened
server at 50010
2008-03-10 20:31:36,419 INFO org.mortbay.util.Credential: Checking
Resource aliases
2008-03-10 20:31:36,563 INFO org.mortbay.http.HttpServer: Version
Jetty/5.1.4
2008-03-10 20:31:36,566 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]
2008-03-10 20:31:36,566 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]
2008-03-10 20:31:37,553 INFO org.mortbay.util.Container: Started
org.mortbay.jetty.servlet.WebApplicationHandler@1dc0e7a
2008-03-10 20:31:37,714 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]
2008-03-10 20:31:37,730 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50075
2008-03-10 20:31:37,730 INFO org.mortbay.util.Container: Started
org.mortbay.jetty.Server@f8f7db
I can leave the cluster running for hours and this slave will never
"register" itself with the namenode. I've been messing with this problem
for three days now and I'm out of ideas. Any suggestions?
Regards,
Tim Nelson
RE: zombie data nodes, not alive but not dead
Posted by dhruba Borthakur <dh...@yahoo-inc.com>.
The following issues might be impacting you (from release notes)
http://issues.apache.org/jira/browse/HADOOP-2185
HADOOP-2185. RPC Server uses any available port if the specified
port is zero. Otherwise it uses the specified port. Also combines
the configuration attributes for the servers' bind address and
port from "x.x.x.x" and "y" to "x.x.x.x:y".
Deprecated configuration variables:
dfs.info.bindAddress
dfs.info.port
dfs.datanode.bindAddress
dfs.datanode.port
dfs.datanode.info.bindAdress
dfs.datanode.info.port
dfs.secondary.info.bindAddress
dfs.secondary.info.port
mapred.job.tracker.info.bindAddress
mapred.job.tracker.info.port
mapred.task.tracker.report.bindAddress
tasktracker.http.bindAddress
tasktracker.http.port
New configuration variables (post HADOOP-2404):
dfs.secondary.http.address
dfs.datanode.address
dfs.datanode.http.address
dfs.http.address
mapred.job.tracker.http.address
mapred.task.tracker.report.address
mapred.task.tracker.http.address
-----Original Message-----
From: Dave Coyle [mailto:home+hadoop@davecoyle.com]
Sent: Monday, March 10, 2008 10:01 PM
To: core-user@hadoop.apache.org
Subject: Re: zombie data nodes, not alive but not dead
On 2008-03-10 23:37:36 -0400, hadoop@enigmasecurity.com wrote:
> I can leave the cluster running for hours and this slave will never
> "register" itself with the namenode. I've been messing with this
problem
> for three days now and I'm out of ideas. Any suggestions?
I had a similar-sounding problem with a 0.16.0 setup I had...
namenode thinks datanodes are dead, but the datanodes complain if
namenode is unreachable so there must be *some* connectivity.
Admittedly I haven't had the time yet to recreate what I did to see if
I had just mangled some config somewhere, but I was eventually able to
sort out my problem by...and yes, this sounds a bit wacky... running
a given datanode interactively, suspending it, then bringing it back
to the foreground. E.g. (assuming your namenode is already running):
$ bin/hadoop datanode
<ctrl-Z>
$ fg
and the datanode then magically registered with the namenode.
Give it a shot... I'm curious to hear if it works for you, too.
-Coyle
Re: zombie data nodes, not alive but not dead
Posted by Dave Coyle <ho...@davecoyle.com>.
On 2008-03-10 23:37:36 -0400, hadoop@enigmasecurity.com wrote:
> I can leave the cluster running for hours and this slave will never
> "register" itself with the namenode. I've been messing with this problem
> for three days now and I'm out of ideas. Any suggestions?
I had a similar-sounding problem with a 0.16.0 setup I had...
namenode thinks datanodes are dead, but the datanodes complain if
namenode is unreachable so there must be *some* connectivity.
Admittedly I haven't had the time yet to recreate what I did to see if
I had just mangled some config somewhere, but I was eventually able to
sort out my problem by...and yes, this sounds a bit wacky... running
a given datanode interactively, suspending it, then bringing it back
to the foreground. E.g. (assuming your namenode is already running):
$ bin/hadoop datanode
<ctrl-Z>
$ fg
and the datanode then magically registered with the namenode.
Give it a shot... I'm curious to hear if it works for you, too.
-Coyle