You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Ben Kim <be...@gmail.com> on 2013/01/21 03:52:10 UTC

Decommissioning a datanode takes forever

Hi!

I followed the decommissioning guide on the hadoop hdfs wiki.

the hdfs web ui shows that the decommissioning proceess has successfully
begun.

it started redeploying 80,000 blocks through the hadoop cluster, but for
some reason it stopped at 9059 blocks. I've waited 30 hours and still no
progress.

Any one with any idea?
-- 

*Benjamin Kim*
*benkimkimben at gmail*

Re: Decommissioning a datanode takes forever

Posted by Ben Kim <be...@gmail.com>.
Impatient I am, I just shut down the cluster and restarted it with empty
exclude file.

If I added the datanode hostname back to the exclude file, and ran hadoop
dfsadmin -refreshNodes, *the datanode goes straight to the dead node *without
going to the descommission process.

I'm done for today. maybe someone else can figure it out when I come back
tomorrow :)

Best regards,
Ben

On Tue, Jan 22, 2013 at 5:38 PM, Ben Kim <be...@gmail.com> wrote:

> UPDATE:
>
> WARN with edit log had nothing to do with the current problem.
>
> However replica placement warnings seem to be suspicious.
> Please have a look at the following logs.
>
>
> 2013-01-22 09:12:10,885 WARN
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
> enough replicas, still in need of 1
> 2013-01-22 00:02:17,541 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> Block: blk_4844131893883391179_3440513,
> Expected Replicas: 10, live replicas: 9, c        orrupt replicas: 0,
> decommissioned replicas: 1, excess replicas: 0, Is Open File: false,
> Datanodes having this block: 203.235.211.155:50010 203.235.211.156:5001020
> 3.235.211.145:50010 203.235.211.144:50010 203.235.211.146:50010
> 203.235.211.158:50010 203.235.211.159:50010 203.235.211.157:50010
> 203.235.211.160:50010 203.235.211.        143:50010 ,
> Current Datanode: 203.235.211.155:50010, Is current datanode
> decommissioning: true
>
> I have set my replication factor to 3. I dont understand why hadoop is
> trying to replicate it to 10 nodes. I have decommissioned one node so
> currently I have 9 nodes in operation. It will never be replicated to 10
> nodes.
>
> I also see that all repeated warning msg like the above is for
> blk_4844131893883391179_3440513.
>
> How would I delete the block? it's not showing as corrupted block on fsck.
> :(
>
> BEN
>
>
>
>
>
> On Tue, Jan 22, 2013 at 9:28 AM, Ben Kim <be...@gmail.com> wrote:
>
>> Hi Varun, Thnk you for the reponse
>>
>> No there doesnt seem to be any corrupted blocks in my cluster.
>> I did "hadoop fsck -blocks /" and it didnt report any corrupted block.
>>
>> However, these are two WARNings in the namenode log, constantly repeating
>> since the decommission.
>>
>>    - 2013-01-22 09:16:30,908 WARN
>>    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Cannot roll edit log,
>>    edits.new files already exists in all healthy directories:
>>    - 2013-01-22 09:12:10,885 WARN
>>    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
>>    enough replicas, still in need of 1
>>
>> There isn't any WARN or ERROR in the decommissioning datanode log
>>
>> Ben
>>
>>
>>
>> On Mon, Jan 21, 2013 at 3:05 PM, varun kumar <va...@gmail.com> wrote:
>>
>>> Hi Ben,
>>>
>>> Are there any corrupted blocks in your hadoop cluster.
>>>
>>> Regards,
>>> Varun Kumar
>>>
>>>
>>> On Mon, Jan 21, 2013 at 8:22 AM, Ben Kim <be...@gmail.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> I followed the decommissioning guide on the hadoop hdfs wiki.
>>>>
>>>> the hdfs web ui shows that the decommissioning proceess has
>>>> successfully begun.
>>>>
>>>> it started redeploying 80,000 blocks through the hadoop cluster, but
>>>> for some reason it stopped at 9059 blocks. I've waited 30 hours and still
>>>> no progress.
>>>>
>>>> Any one with any idea?
>>>>  --
>>>>
>>>> *Benjamin Kim*
>>>> *benkimkimben at gmail*
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Varun Kumar.P
>>>
>>
>>
>>
>> --
>>
>> *Benjamin Kim*
>> *benkimkimben at gmail*
>>
>
>
>
> --
>
> *Benjamin Kim*
> *benkimkimben at gmail*
>



-- 

*Benjamin Kim*
*benkimkimben at gmail*

Re: Decommissioning a datanode takes forever

Posted by Ben Kim <be...@gmail.com>.
UPDATE:

WARN with edit log had nothing to do with the current problem.

However replica placement warnings seem to be suspicious.
Please have a look at the following logs.

2013-01-22 09:12:10,885 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
enough replicas, still in need of 1
2013-01-22 00:02:17,541 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Block: blk_4844131893883391179_3440513,
Expected Replicas: 10, live replicas: 9, c        orrupt replicas: 0,
decommissioned replicas: 1, excess replicas: 0, Is Open File: false,
Datanodes having this block: 203.235.211.155:50010
203.235.211.156:5001020
3.235.211.145:50010 203.235.211.144:50010 203.235.211.146:50010
203.235.211.158:50010 203.235.211.159:50010 203.235.211.157:50010
203.235.211.160:50010 203.235.211.        143:50010 ,
Current Datanode: 203.235.211.155:50010, Is current datanode
decommissioning: true

I have set my replication factor to 3. I dont understand why hadoop is
trying to replicate it to 10 nodes. I have decommissioned one node so
currently I have 9 nodes in operation. It will never be replicated to 10
nodes.

I also see that all repeated warning msg like the above is for
blk_4844131893883391179_3440513.

How would I delete the block? it's not showing as corrupted block on fsck.
:(

BEN




On Tue, Jan 22, 2013 at 9:28 AM, Ben Kim <be...@gmail.com> wrote:

> Hi Varun, Thnk you for the reponse
>
> No there doesnt seem to be any corrupted blocks in my cluster.
> I did "hadoop fsck -blocks /" and it didnt report any corrupted block.
>
> However, these are two WARNings in the namenode log, constantly repeating
> since the decommission.
>
>    - 2013-01-22 09:16:30,908 WARN
>    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Cannot roll edit log,
>    edits.new files already exists in all healthy directories:
>    - 2013-01-22 09:12:10,885 WARN
>    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
>    enough replicas, still in need of 1
>
> There isn't any WARN or ERROR in the decommissioning datanode log
>
> Ben
>
>
>
> On Mon, Jan 21, 2013 at 3:05 PM, varun kumar <va...@gmail.com> wrote:
>
>> Hi Ben,
>>
>> Are there any corrupted blocks in your hadoop cluster.
>>
>> Regards,
>> Varun Kumar
>>
>>
>> On Mon, Jan 21, 2013 at 8:22 AM, Ben Kim <be...@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> I followed the decommissioning guide on the hadoop hdfs wiki.
>>>
>>> the hdfs web ui shows that the decommissioning proceess has successfully
>>> begun.
>>>
>>> it started redeploying 80,000 blocks through the hadoop cluster, but for
>>> some reason it stopped at 9059 blocks. I've waited 30 hours and still no
>>> progress.
>>>
>>> Any one with any idea?
>>>  --
>>>
>>> *Benjamin Kim*
>>> *benkimkimben at gmail*
>>>
>>
>>
>>
>> --
>> Regards,
>> Varun Kumar.P
>>
>
>
>
> --
>
> *Benjamin Kim*
> *benkimkimben at gmail*
>



-- 

*Benjamin Kim*
*benkimkimben at gmail*

Re: Decommissioning a datanode takes forever

Posted by Ben Kim <be...@gmail.com>.
Hi Varun, Thnk you for the reponse

No there doesnt seem to be any corrupted blocks in my cluster.
I did "hadoop fsck -blocks /" and it didnt report any corrupted block.

However, these are two WARNings in the namenode log, constantly repeating
since the decommission.

   - 2013-01-22 09:16:30,908 WARN
   org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Cannot roll edit log,
   edits.new files already exists in all healthy directories:
   - 2013-01-22 09:12:10,885 WARN
   org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
   enough replicas, still in need of 1

There isn't any WARN or ERROR in the decommissioning datanode log

Ben


On Mon, Jan 21, 2013 at 3:05 PM, varun kumar <va...@gmail.com> wrote:

> Hi Ben,
>
> Are there any corrupted blocks in your hadoop cluster.
>
> Regards,
> Varun Kumar
>
>
> On Mon, Jan 21, 2013 at 8:22 AM, Ben Kim <be...@gmail.com> wrote:
>
>> Hi!
>>
>> I followed the decommissioning guide on the hadoop hdfs wiki.
>>
>> the hdfs web ui shows that the decommissioning proceess has successfully
>> begun.
>>
>> it started redeploying 80,000 blocks through the hadoop cluster, but for
>> some reason it stopped at 9059 blocks. I've waited 30 hours and still no
>> progress.
>>
>> Any one with any idea?
>>  --
>>
>> *Benjamin Kim*
>> *benkimkimben at gmail*
>>
>
>
>
> --
> Regards,
> Varun Kumar.P
>



-- 

*Benjamin Kim*
*benkimkimben at gmail*

Re: Decommissioning a datanode takes forever

Posted by varun kumar <va...@gmail.com>.
Hi Ben,

Are there any corrupted blocks in your hadoop cluster.

Regards,
Varun Kumar

On Mon, Jan 21, 2013 at 8:22 AM, Ben Kim <be...@gmail.com> wrote:

> Hi!
>
> I followed the decommissioning guide on the hadoop hdfs wiki.
>
> the hdfs web ui shows that the decommissioning proceess has successfully
> begun.
>
> it started redeploying 80,000 blocks through the hadoop cluster, but for
> some reason it stopped at 9059 blocks. I've waited 30 hours and still no
> progress.
>
> Any one with any idea?
> --
>
> *Benjamin Kim*
> *benkimkimben at gmail*
>



-- 
Regards,
Varun Kumar.P