You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Espen Amble Kolstad <es...@trank.no> on 2007/03/27 08:49:56 UTC

Decommission in hadoop-0.12.2

Hi,

I'm trying to decommission a node with hadoop-0.12.2.
I use the property dfs.hosts.exclude, since the command haddop 
dfsadmin -decommission seems to be gone.
I then start the cluster with an emtpy exclude-file, add the name of the node 
to decommission and run hadoop dfsadmin -refreshNodes.
The log then says:
2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning node 
81.93.168.215:50010

But nothing happens.
I've left it in this state over night, but still nothing.

Am I missing something ?

- Espen

Re: Decommission in hadoop-0.12.2

Posted by Espen Amble Kolstad <es...@trank.no>.
Hi,

I changed replication for the entire hdfs to 2, and then tried to 
decommission. That seemed to do the trick. The namenode-log immediately 
started printing:
2007-03-27 17:37:19,954 INFO  dfs.StateChange - BLOCK* 
NameSystem.pendingTransfer: ask x.x.x.x:50010 to replicate 
blk_9167696482646713604 to datanode(s) x.x.x.x:50010
2007-03-27 17:37:19,954 INFO  dfs.StateChange - BLOCK* 
NameSystem.pendingTransfer: ask x.x.x.x:50010 to replicate 
blk_9168899963250271798 to datanode(s) x.x.x.x:50010
and then finally:
2007-03-28 00:10:41,876 INFO  fs.FSNamesystem - Decommission complete for node 
x.x.x.x:50010

Could it be decommission doesn't work when replication is set to 1?

Thanks for your help!

- Espen

On Tuesday 27 March 2007 18:46:54 Dhruba Borthakur wrote:
> I agree. A decommission-meter would be a really helpful tool to monitor the
> progress of a decommission command.
>
> Thanks,
> dhruba
>
> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: Tuesday, March 27, 2007 9:45 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Decommission in hadoop-0.12.2
>
> Dhruba Borthakur wrote:
> > The decommission-in-progress state indicates that the Namenode is
>
> triggering
>
> > replication of blocks that reside on the node-being-decommissioned. When
>
> all
>
> > those blocks get replicated to another Datanode(s),then the state should
> > change to 'decommissioned".
> >
> > You can run a bin/hdoop fsck -blocks -locations -files to list out all
> > the locations of all blocks in the fs (this might take lots of time
> > depending
>
> on
>
> > the number of files). Please verify if any of the blocks that reside on
>
> the
>
> > decommission-in-progress node have 2 replicas. Once all those blocks have
> > two replicas (because you have set replication factor to 1), the
> > decommissioning should be complete.
>
> ... though it would be nice if the report gave a "xx% complete"
> information ...



RE: Decommission in hadoop-0.12.2

Posted by Dhruba Borthakur <dh...@yahoo-inc.com>.
I agree. A decommission-meter would be a really helpful tool to monitor the
progress of a decommission command.

Thanks,
dhruba

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Tuesday, March 27, 2007 9:45 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Decommission in hadoop-0.12.2

Dhruba Borthakur wrote:
> The decommission-in-progress state indicates that the Namenode is
triggering
> replication of blocks that reside on the node-being-decommissioned. When
all
> those blocks get replicated to another Datanode(s),then the state should
> change to 'decommissioned". 
> 
> You can run a bin/hdoop fsck -blocks -locations -files to list out all the
> locations of all blocks in the fs (this might take lots of time depending
on
> the number of files). Please verify if any of the blocks that reside on
the
> decommission-in-progress node have 2 replicas. Once all those blocks have
> two replicas (because you have set replication factor to 1), the
> decommissioning should be complete.

... though it would be nice if the report gave a "xx% complete" 
information ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Decommission in hadoop-0.12.2

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dhruba Borthakur wrote:
> The decommission-in-progress state indicates that the Namenode is triggering
> replication of blocks that reside on the node-being-decommissioned. When all
> those blocks get replicated to another Datanode(s),then the state should
> change to 'decommissioned". 
> 
> You can run a bin/hdoop fsck -blocks -locations -files to list out all the
> locations of all blocks in the fs (this might take lots of time depending on
> the number of files). Please verify if any of the blocks that reside on the
> decommission-in-progress node have 2 replicas. Once all those blocks have
> two replicas (because you have set replication factor to 1), the
> decommissioning should be complete.

... though it would be nice if the report gave a "xx% complete" 
information ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: Decommission in hadoop-0.12.2

Posted by Dhruba Borthakur <dh...@yahoo-inc.com>.
The decommission-in-progress state indicates that the Namenode is triggering
replication of blocks that reside on the node-being-decommissioned. When all
those blocks get replicated to another Datanode(s),then the state should
change to 'decommissioned". 

You can run a bin/hdoop fsck -blocks -locations -files to list out all the
locations of all blocks in the fs (this might take lots of time depending on
the number of files). Please verify if any of the blocks that reside on the
decommission-in-progress node have 2 replicas. Once all those blocks have
two replicas (because you have set replication factor to 1), the
decommissioning should be complete.

Thanks,
dhruba


-----Original Message-----
From: Espen Amble Kolstad [mailto:espen@trank.no] 
Sent: Tuesday, March 27, 2007 1:23 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Decommission in hadoop-0.12.2

On Tuesday 27 March 2007 10:03:41 Andrzej Bialecki wrote:
> Espen Amble Kolstad wrote:
> > On Tuesday 27 March 2007 09:27:58 Andrzej Bialecki wrote:
> >> Espen Amble Kolstad wrote:
> >>> Hi,
> >>>
> >>> I'm trying to decommission a node with hadoop-0.12.2.
> >>> I use the property dfs.hosts.exclude, since the command haddop
> >>> dfsadmin -decommission seems to be gone.
> >>> I then start the cluster with an emtpy exclude-file, add the name of
> >>> the node to decommission and run hadoop dfsadmin -refreshNodes.
> >>> The log then says:
> >>> 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning
> >>> node 81.93.168.215:50010
> >>>
> >>> But nothing happens.
> >>> I've left it in this state over night, but still nothing.
> >>>
> >>> Am I missing something ?
> >>
> >> What does the dfsadmin -report says about this node? It takes time to
> >> ensure that all blocks are replicated from this node to other nodes.
> >
> > Hi,
> >
> > dfsadmin -report:
> >
> > Name: 81.93.168.215:50010
> > State          : Decommission in progress
> > Total raw bytes: 1438871724032 (1.30 TB)
> > Used raw bytes: 270070137404 (0.24 TB)
> > % used: 18.76%
> > Last contact: Tue Mar 27 09:42:26 CEST 2007
> >
> > In the web-interface (dfshealth.jsp) no change can be seen in % or the
> > number of blocks on any of the nodes.
>
> You may want to check the datanode logs if there are any exceptions
> reported.. Also, things are taking time - I believe the datanodes
> synchronize their block information piecewise, so that they don't
> overwhelm the namenode. It surely takes some time in my case, even
> though the disk size per node that I use is much smaller.
>
> Regarding the number of blocks - if all blocks are already present on
> other datanodes at least in 1 copy, then no new blocks need to be
> created - I'm not sure when the namenode decides that these blocks
> should get additional replicas: during the decommissioning or after it's
> complete ...
>
> It would be nice to have a progress meter on the decommissioning
> process, though.

Hi,

I have replication set to 1 for the whole hdfs, so there should not be any 
other replicas.
I can't find any errors in my logs. And the namenode-log looks like this (at

INFO level):
2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning node 
81.93.168.215:50010
2007-03-27 09:04:48,831 INFO  fs.FSNamesystem - Roll Edit Log
2007-03-27 09:04:49,500 INFO  fs.FSNamesystem - Roll FSImage
2007-03-27 10:04:50,221 INFO  fs.FSNamesystem - Roll Edit Log
2007-03-27 10:04:50,360 INFO  fs.FSNamesystem - Roll FSImage

- Espen


Re: Decommission in hadoop-0.12.2

Posted by Espen Amble Kolstad <es...@trank.no>.
On Tuesday 27 March 2007 10:03:41 Andrzej Bialecki wrote:
> Espen Amble Kolstad wrote:
> > On Tuesday 27 March 2007 09:27:58 Andrzej Bialecki wrote:
> >> Espen Amble Kolstad wrote:
> >>> Hi,
> >>>
> >>> I'm trying to decommission a node with hadoop-0.12.2.
> >>> I use the property dfs.hosts.exclude, since the command haddop
> >>> dfsadmin -decommission seems to be gone.
> >>> I then start the cluster with an emtpy exclude-file, add the name of
> >>> the node to decommission and run hadoop dfsadmin -refreshNodes.
> >>> The log then says:
> >>> 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning
> >>> node 81.93.168.215:50010
> >>>
> >>> But nothing happens.
> >>> I've left it in this state over night, but still nothing.
> >>>
> >>> Am I missing something ?
> >>
> >> What does the dfsadmin -report says about this node? It takes time to
> >> ensure that all blocks are replicated from this node to other nodes.
> >
> > Hi,
> >
> > dfsadmin -report:
> >
> > Name: 81.93.168.215:50010
> > State          : Decommission in progress
> > Total raw bytes: 1438871724032 (1.30 TB)
> > Used raw bytes: 270070137404 (0.24 TB)
> > % used: 18.76%
> > Last contact: Tue Mar 27 09:42:26 CEST 2007
> >
> > In the web-interface (dfshealth.jsp) no change can be seen in % or the
> > number of blocks on any of the nodes.
>
> You may want to check the datanode logs if there are any exceptions
> reported.. Also, things are taking time - I believe the datanodes
> synchronize their block information piecewise, so that they don't
> overwhelm the namenode. It surely takes some time in my case, even
> though the disk size per node that I use is much smaller.
>
> Regarding the number of blocks - if all blocks are already present on
> other datanodes at least in 1 copy, then no new blocks need to be
> created - I'm not sure when the namenode decides that these blocks
> should get additional replicas: during the decommissioning or after it's
> complete ...
>
> It would be nice to have a progress meter on the decommissioning
> process, though.

Hi,

I have replication set to 1 for the whole hdfs, so there should not be any 
other replicas.
I can't find any errors in my logs. And the namenode-log looks like this (at 
INFO level):
2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning node 
81.93.168.215:50010
2007-03-27 09:04:48,831 INFO  fs.FSNamesystem - Roll Edit Log
2007-03-27 09:04:49,500 INFO  fs.FSNamesystem - Roll FSImage
2007-03-27 10:04:50,221 INFO  fs.FSNamesystem - Roll Edit Log
2007-03-27 10:04:50,360 INFO  fs.FSNamesystem - Roll FSImage

- Espen

Re: Decommission in hadoop-0.12.2

Posted by Andrzej Bialecki <ab...@getopt.org>.
Espen Amble Kolstad wrote:
> On Tuesday 27 March 2007 09:27:58 Andrzej Bialecki wrote:
>> Espen Amble Kolstad wrote:
>>> Hi,
>>>
>>> I'm trying to decommission a node with hadoop-0.12.2.
>>> I use the property dfs.hosts.exclude, since the command haddop
>>> dfsadmin -decommission seems to be gone.
>>> I then start the cluster with an emtpy exclude-file, add the name of the
>>> node to decommission and run hadoop dfsadmin -refreshNodes.
>>> The log then says:
>>> 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning
>>> node 81.93.168.215:50010
>>>
>>> But nothing happens.
>>> I've left it in this state over night, but still nothing.
>>>
>>> Am I missing something ?
>> What does the dfsadmin -report says about this node? It takes time to
>> ensure that all blocks are replicated from this node to other nodes.
> 
> Hi,
> 
> dfsadmin -report:
> 
> Name: 81.93.168.215:50010
> State          : Decommission in progress
> Total raw bytes: 1438871724032 (1.30 TB)
> Used raw bytes: 270070137404 (0.24 TB)
> % used: 18.76%
> Last contact: Tue Mar 27 09:42:26 CEST 2007
> 
> In the web-interface (dfshealth.jsp) no change can be seen in % or the number 
> of blocks on any of the nodes.

You may want to check the datanode logs if there are any exceptions 
reported.. Also, things are taking time - I believe the datanodes 
synchronize their block information piecewise, so that they don't 
overwhelm the namenode. It surely takes some time in my case, even 
though the disk size per node that I use is much smaller.

Regarding the number of blocks - if all blocks are already present on 
other datanodes at least in 1 copy, then no new blocks need to be 
created - I'm not sure when the namenode decides that these blocks 
should get additional replicas: during the decommissioning or after it's 
complete ...

It would be nice to have a progress meter on the decommissioning 
process, though.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Decommission in hadoop-0.12.2

Posted by Espen Amble Kolstad <es...@trank.no>.
On Tuesday 27 March 2007 09:27:58 Andrzej Bialecki wrote:
> Espen Amble Kolstad wrote:
> > Hi,
> >
> > I'm trying to decommission a node with hadoop-0.12.2.
> > I use the property dfs.hosts.exclude, since the command haddop
> > dfsadmin -decommission seems to be gone.
> > I then start the cluster with an emtpy exclude-file, add the name of the
> > node to decommission and run hadoop dfsadmin -refreshNodes.
> > The log then says:
> > 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning
> > node 81.93.168.215:50010
> >
> > But nothing happens.
> > I've left it in this state over night, but still nothing.
> >
> > Am I missing something ?
>
> What does the dfsadmin -report says about this node? It takes time to
> ensure that all blocks are replicated from this node to other nodes.

Hi,

dfsadmin -report:

Name: 81.93.168.215:50010
State          : Decommission in progress
Total raw bytes: 1438871724032 (1.30 TB)
Used raw bytes: 270070137404 (0.24 TB)
% used: 18.76%
Last contact: Tue Mar 27 09:42:26 CEST 2007

In the web-interface (dfshealth.jsp) no change can be seen in % or the number 
of blocks on any of the nodes.

- Espen

Re: Decommission in hadoop-0.12.2

Posted by Andrzej Bialecki <ab...@getopt.org>.
Espen Amble Kolstad wrote:
> Hi,
> 
> I'm trying to decommission a node with hadoop-0.12.2.
> I use the property dfs.hosts.exclude, since the command haddop 
> dfsadmin -decommission seems to be gone.
> I then start the cluster with an emtpy exclude-file, add the name of the node 
> to decommission and run hadoop dfsadmin -refreshNodes.
> The log then says:
> 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning node 
> 81.93.168.215:50010
> 
> But nothing happens.
> I've left it in this state over night, but still nothing.
> 
> Am I missing something ?

What does the dfsadmin -report says about this node? It takes time to 
ensure that all blocks are replicated from this node to other nodes.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com