You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Scott White <sc...@gmail.com> on 2010/05/18 06:32:13 UTC

Data node decommission doesn't seem to be working correctly

I followed the steps mentioned here:
http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
decommission a data node. What I see from the namenode is the hostname of
the machine that I decommissioned shows up in both the list of dead nodes
but also live nodes where its admin status is marked as 'In Service'. It's
been twelve hours and there is no sign in the namenode logs that the node
has been decommissioned. Any suggestions of what might be the problem and
what to try to ensure that this node gets safely taken down?

thanks in advance,
Scott

Re: Data node decommission doesn't seem to be working correctly

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Scott,

If the node shows up in the dead nodes and the live nodes as you say, it's definitely not even attempting to be decommissioned.  If HDFS was attempting decommissioning and you restart the namenode, then it would only show up in the dead nodes list.

Another option is to just turn off HDFS on that node alone, and don't physically delete the data from the node until HDFS completely recovers.  This is not recommended for "production usage", as it creates a period where the cluster is in danger of losing files.  However, it can be used as a one-off to get over this speed-hump.

Brian

On May 18, 2010, at 12:02 PM, Scott White wrote:

> Dfsadmin -report reports the hostname for that machine and not the ip. That
> machine happens to be the master node which is why I am trying to
> decommission the data node there since I only want the data node running on
> the slave nodes. Dfs admin -report reports all the ips for the slave nodes.
> 
> One question: I believe that the namenode was accidentally restarted during
> the 12 hours or so I was waiting for the decommission to complete. Would
> this put things into a bad state? I did try running dfsadmin -refreshNodes
> after it was restarted.
> 
> Scott
> 
> 
> On Tue, May 18, 2010 at 5:44 AM, Brian Bockelman <bb...@cse.unl.edu>wrote:
> 
>> Hey Scott,
>> 
>> Hadoop tends to get confused by nodes with multiple hostnames or multiple
>> IP addresses.  Is this your case?
>> 
>> I can't remember precisely what our admin does, but I think he puts in the
>> IP address which Hadoop listens on in the exclude-hosts file.
>> 
>> Look in the output of
>> 
>> hadoop dfsadmin -report
>> 
>> to determine precisely which IP address your datanode is listening on.
>> 
>> Brian
>> 
>> On May 17, 2010, at 11:32 PM, Scott White wrote:
>> 
>>> I followed the steps mentioned here:
>>> http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
>>> decommission a data node. What I see from the namenode is the hostname of
>>> the machine that I decommissioned shows up in both the list of dead nodes
>>> but also live nodes where its admin status is marked as 'In Service'.
>> It's
>>> been twelve hours and there is no sign in the namenode logs that the node
>>> has been decommissioned. Any suggestions of what might be the problem and
>>> what to try to ensure that this node gets safely taken down?
>>> 
>>> thanks in advance,
>>> Scott
>> 
>> 


Re: Data node decommission doesn't seem to be working correctly

Posted by Scott White <sc...@gmail.com>.
Dfsadmin -report reports the hostname for that machine and not the ip. That
machine happens to be the master node which is why I am trying to
decommission the data node there since I only want the data node running on
the slave nodes. Dfs admin -report reports all the ips for the slave nodes.

One question: I believe that the namenode was accidentally restarted during
the 12 hours or so I was waiting for the decommission to complete. Would
this put things into a bad state? I did try running dfsadmin -refreshNodes
after it was restarted.

Scott


On Tue, May 18, 2010 at 5:44 AM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey Scott,
>
> Hadoop tends to get confused by nodes with multiple hostnames or multiple
> IP addresses.  Is this your case?
>
> I can't remember precisely what our admin does, but I think he puts in the
> IP address which Hadoop listens on in the exclude-hosts file.
>
> Look in the output of
>
> hadoop dfsadmin -report
>
> to determine precisely which IP address your datanode is listening on.
>
> Brian
>
> On May 17, 2010, at 11:32 PM, Scott White wrote:
>
> > I followed the steps mentioned here:
> > http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
> > decommission a data node. What I see from the namenode is the hostname of
> > the machine that I decommissioned shows up in both the list of dead nodes
> > but also live nodes where its admin status is marked as 'In Service'.
> It's
> > been twelve hours and there is no sign in the namenode logs that the node
> > has been decommissioned. Any suggestions of what might be the problem and
> > what to try to ensure that this node gets safely taken down?
> >
> > thanks in advance,
> > Scott
>
>

Re: Data node decommission doesn't seem to be working correctly

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Scott,

Hadoop tends to get confused by nodes with multiple hostnames or multiple IP addresses.  Is this your case?

I can't remember precisely what our admin does, but I think he puts in the IP address which Hadoop listens on in the exclude-hosts file.

Look in the output of 

hadoop dfsadmin -report

to determine precisely which IP address your datanode is listening on.

Brian

On May 17, 2010, at 11:32 PM, Scott White wrote:

> I followed the steps mentioned here:
> http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
> decommission a data node. What I see from the namenode is the hostname of
> the machine that I decommissioned shows up in both the list of dead nodes
> but also live nodes where its admin status is marked as 'In Service'. It's
> been twelve hours and there is no sign in the namenode logs that the node
> has been decommissioned. Any suggestions of what might be the problem and
> what to try to ensure that this node gets safely taken down?
> 
> thanks in advance,
> Scott


Re: Data node decommission doesn't seem to be working correctly

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
Hi Scott, 

You might be hitting two different issues.

1) Decommission not finishing.
   https://issues.apache.org/jira/browse/HDFS-694  explains decommission
never finishing due to open files in 0.20

2) Nodes showing up both in live and dead nodes.
   I remember Suresh taking a look at this.
   It was something about same node registered with hostname and IP
separately (when datanode is rejumped and started fresh (?)).

Cc-ing Suresh.

Koji

On 5/17/10 9:32 PM, "Scott White" <sc...@gmail.com> wrote:

> I followed the steps mentioned here:
> http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
> decommission a data node. What I see from the namenode is the hostname of
> the machine that I decommissioned shows up in both the list of dead nodes
> but also live nodes where its admin status is marked as 'In Service'. It's
> been twelve hours and there is no sign in the namenode logs that the node
> has been decommissioned. Any suggestions of what might be the problem and
> what to try to ensure that this node gets safely taken down?
> 
> thanks in advance,
> Scott