You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Stack <st...@duboce.net> on 2013/01/31 07:34:57 UTC

How to remove three disks from three different nodes in a ten node cluster in less than an hour without losing replicas?

Here is a little puzzle.

An admin works for a cash-strapped, popular web shop.  At the datacenter
she has a ten node cluster that is heavily used.  It runs hot all day long
and decommissioning a node with its background replicating of 12 disks
worth of data messes up the work load she has on top of it and makes her
clients very unhappy.  Replicating the data of one node takes at least an
hour.  This cluster has three bad disks in three different nodes
(replication factor is 3).  The admin lives an hour from the datacenter.
 She can't afford a cage monkey and so must replace the disks herself.

If she left home at 2pm and had to be back by 6pm before the kids came home
from school, how would she replace the three disks without for sure losing
a replica?

Is the only answer remove one, wait on clean fsck run, remove the next one?

Thanks,
St.Ack

Re: How to remove three disks from three different nodes in a ten node cluster in less than an hour without losing replicas?

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

It sounds like what you would like is a way to decommission just one
storage directory on the DataNode. We don't currently support that.

You might be able to get something approaching this result with
"chmod 000 $storage_directory_root".  That would at least prevent new
blocks from being created on the disk which you don't trust any more.  It
would also cause the existing blocks to be re-replicated when the
DirectoryScanner re-ran and noticed it couldn't get to them.  Note that I
haven't actually tested the chmod solution, though, so your milage may vary.

best,
Colin

On Wed, Jan 30, 2013 at 10:34 PM, Stack <st...@duboce.net> wrote:

> Here is a little puzzle.
>
> An admin works for a cash-strapped, popular web shop.  At the datacenter
> she has a ten node cluster that is heavily used.  It runs hot all day long
> and decommissioning a node with its background replicating of 12 disks
> worth of data messes up the work load she has on top of it and makes her
> clients very unhappy.  Replicating the data of one node takes at least an
> hour.  This cluster has three bad disks in three different nodes
> (replication factor is 3).  The admin lives an hour from the datacenter.
>  She can't afford a cage monkey and so must replace the disks herself.
>
> If she left home at 2pm and had to be back by 6pm before the kids came
> home from school, how would she replace the three disks without for sure
> losing a replica?
>
> Is the only answer remove one, wait on clean fsck run, remove the next one?
>
> Thanks,
> St.Ack
>
>
>
>