You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Jonathan Disher <jd...@parad.net> on 2011/01/03 04:20:15 UTC

DataNode internal balancing, performance recommendations

I see that there was a thread on this in December, but I can't retrieve it to reply properly, oh well.

So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).  Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks in JBOD.  Original system config was six 1TB data disks - we added the last four disks months later.  I'm sure you can all guess, we have some interesting internal usage balancing issues on most of the nodes.  To date, when individual disks get critically low on space (earlier this week I had a node with six disks around 97% full, four around 70%), we've been pulling them from the cluster, formatting the data disks, and sticking them back in (with a rebalance running to keep the cluster in some semblance of order).

Obviously if there was a better way to do this, I'd love to see it.  I see that there are recommendations of killing the DataNode process and manually moving files, but my concern is that the DataNode process will spend an enormous amount of time tracking down these moves (currently around 820,000 blocks/node).  And it's not necessarily easy to automate, so there's the danger of nuking blocks, and making the problems worse.  Are there alternatives to manual moves (or more automated ways that exist)?  Or has my brute-force rebalance got the best chance of success, albeit slowly?

We are also building a new cluster - starting around 1.2PB raw, eventually growing to around 5PB, for near-line storage of data.  Our storage nodes will probably be 4U systems with 72 data disks each (yeah, good times).  The problem with this becomes obvious - with the way Hadoop works today, if a disk fails, the datanode process chokes and dies when it tries to write to it.  We've been told repeatedly that Hadoop doesn't perform well when it operates on RAID arrays, but, to scale efffectively, we're going to have to do just that - three 24 disk controllers in RAID-6 mode.  How bad is this going to be?  JBOD just doesn't scale beyond a couple disks per machine, the failure rate will knock machines out of the cluster too often (and at 60TB per node, rebalancing will take forever, even if I let it saturate gigabit).

I appreciate opinions and suggestions.  Thanks!

-j

Re: DataNode internal balancing, performance recommendations

Posted by Eli Collins <el...@cloudera.com>.
On Mon, Jan 3, 2011 at 11:55 AM, Jonathan Disher <jd...@parad.net> wrote:
> The problem is, what do you define as a failure?  If the disk is failing, writes will fail to the filesystem - how does Hadoop differentiate between permissions and physical disk failure?  They both return error.
>

Anything that prevents the volume (mount) from being read or written.
Any failure to write to the volume is considered a failure to use the
volume. Since HDFS doesn't support RO volumes (eg it can't handle a
mount that can be read but not written) these all count as failures
and will cause the volume to taken offline.

> And yeah, the idea of stopping the datanode, removing the affected mount from hdfs-site.xml, and restarting has been discussed.  The problem is, when that disk gets replaced, and readded, then I have horrible internal balance issues.  Thus causing the problem I have now :(

What's the particular issue?  Having an unbalanced set of local disks
should at worst be a performance problem. HDFS doesn't write blocks to
full volumes, it will just start using the other disks.

Thanks,
Eli

> -j
>
> On Jan 3, 2011, at 9:07 AM, Eli Collins wrote:
>
>> Hey Jonathan,
>>
>> There's an option (dfs.datanode.failed.volumes.tolerated, introduced
>> in HDFS-1161) that allows you to specify the number of volumes that
>> are allowed to fail before a datanode stops offering service.
>>
>> There's an operational issue that still needs to be addressed
>> (HDFS-1158) that you should be aware of - the DN will still not start
>> if any of the volumes have failed, so to restart the DN you'll need
>> you'll need to either unconfigure the failed volumes or fix them. I'd
>> like to make DN startup respect the config value so it tolerates
>> failed volumes on startup as well.
>>
>> Thanks,
>> Eli
>>
>> On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jd...@parad.net> wrote:
>>> I see that there was a thread on this in December, but I can't retrieve it to reply properly, oh well.
>>>
>>> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).  Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks in JBOD.  Original system config was six 1TB data disks - we added the last four disks months later.  I'm sure you can all guess, we have some interesting internal usage balancing issues on most of the nodes.  To date, when individual disks get critically low on space (earlier this week I had a node with six disks around 97% full, four around 70%), we've been pulling them from the cluster, formatting the data disks, and sticking them back in (with a rebalance running to keep the cluster in some semblance of order).
>>>
>>> Obviously if there was a better way to do this, I'd love to see it.  I see that there are recommendations of killing the DataNode process and manually moving files, but my concern is that the DataNode process will spend an enormous amount of time tracking down these moves (currently around 820,000 blocks/node).  And it's not necessarily easy to automate, so there's the danger of nuking blocks, and making the problems worse.  Are there alternatives to manual moves (or more automated ways that exist)?  Or has my brute-force rebalance got the best chance of success, albeit slowly?
>>>
>>> We are also building a new cluster - starting around 1.2PB raw, eventually growing to around 5PB, for near-line storage of data.  Our storage nodes will probably be 4U systems with 72 data disks each (yeah, good times).  The problem with this becomes obvious - with the way Hadoop works today, if a disk fails, the datanode process chokes and dies when it tries to write to it.  We've been told repeatedly that Hadoop doesn't perform well when it operates on RAID arrays, but, to scale efffectively, we're going to have to do just that - three 24 disk controllers in RAID-6 mode.  How bad is this going to be?  JBOD just doesn't scale beyond a couple disks per machine, the failure rate will knock machines out of the cluster too often (and at 60TB per node, rebalancing will take forever, even if I let it saturate gigabit).
>>>
>>> I appreciate opinions and suggestions.  Thanks!
>>>
>>> -j
>
>

Re: DataNode internal balancing, performance recommendations

Posted by Eli Collins <el...@cloudera.com>.
On Mon, Jan 3, 2011 at 10:29 PM, Jonathan Disher <jd...@parad.net> wrote:
> That's what we've been doing.  Again, the problem is, we still have to pull
> the datanode out of rotation and change config, replace disk, put it back...
> even if I have spares on hand and finish this in a few minutes, I still have
> one empty disk and many tens of not-empty disks.

Aside from performance is there another issue?   Ideally of course the
new disks would automatically get re-balanced, and you could
rate-limit the transfers to limit the impact on the machine.

> Monitoring and identifying
> the failure isn't the problem, we have that down pat.  I'm hoping for a
> better way to re-balance the disks in the node after a failure.  I suspect
> the sad answer is that what I'm doing now is the best thing for it.

HDFS-1312 tracks re-balancing disks within a datanode. Currently
people re-balance the directories manually when the datanode is
powered off (datanodes don't care which blocks reside in which volumes
so you can safely rebalance by hand).

Thanks,
Eli

> -j
> On Jan 3, 2011, at 10:21 PM, Esteban Gutierrez Moguel wrote:
>
> Jonathan,
> Hadoop will throw an exception according to the kind of error:
> AccessControlException if its permission related or IOException for any
> other disk related task.
> A safer approach to handle physical failures would be monitoring syslog
> messages (Syslog4j, nagios, ganglia, etc.) and if you are lucky enough and
> the node doesn't hangs after the disk failure, you could shutdown it
> gracefully.
> esteban.
> On Mon, Jan 3, 2011 at 13:55, Jonathan Disher <jd...@parad.net> wrote:
>>
>> The problem is, what do you define as a failure?  If the disk is failing,
>> writes will fail to the filesystem - how does Hadoop differentiate between
>> permissions and physical disk failure?  They both return error.
>>
>> And yeah, the idea of stopping the datanode, removing the affected mount
>> from hdfs-site.xml, and restarting has been discussed.  The problem is, when
>> that disk gets replaced, and readded, then I have horrible internal balance
>> issues.  Thus causing the problem I have now :(
>>
>> -j
>>
>> On Jan 3, 2011, at 9:07 AM, Eli Collins wrote:
>>
>> > Hey Jonathan,
>> >
>> > There's an option (dfs.datanode.failed.volumes.tolerated, introduced
>> > in HDFS-1161) that allows you to specify the number of volumes that
>> > are allowed to fail before a datanode stops offering service.
>> >
>> > There's an operational issue that still needs to be addressed
>> > (HDFS-1158) that you should be aware of - the DN will still not start
>> > if any of the volumes have failed, so to restart the DN you'll need
>> > you'll need to either unconfigure the failed volumes or fix them. I'd
>> > like to make DN startup respect the config value so it tolerates
>> > failed volumes on startup as well.
>> >
>> > Thanks,
>> > Eli
>> >
>> > On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jd...@parad.net>
>> > wrote:
>> >> I see that there was a thread on this in December, but I can't retrieve
>> >> it to reply properly, oh well.
>> >>
>> >> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).
>> >>  Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks
>> >> in JBOD.  Original system config was six 1TB data disks - we added the last
>> >> four disks months later.  I'm sure you can all guess, we have some
>> >> interesting internal usage balancing issues on most of the nodes.  To date,
>> >> when individual disks get critically low on space (earlier this week I had a
>> >> node with six disks around 97% full, four around 70%), we've been pulling
>> >> them from the cluster, formatting the data disks, and sticking them back in
>> >> (with a rebalance running to keep the cluster in some semblance of order).
>> >>
>> >> Obviously if there was a better way to do this, I'd love to see it.  I
>> >> see that there are recommendations of killing the DataNode process and
>> >> manually moving files, but my concern is that the DataNode process will
>> >> spend an enormous amount of time tracking down these moves (currently around
>> >> 820,000 blocks/node).  And it's not necessarily easy to automate, so there's
>> >> the danger of nuking blocks, and making the problems worse.  Are there
>> >> alternatives to manual moves (or more automated ways that exist)?  Or has my
>> >> brute-force rebalance got the best chance of success, albeit slowly?
>> >>
>> >> We are also building a new cluster - starting around 1.2PB raw,
>> >> eventually growing to around 5PB, for near-line storage of data.  Our
>> >> storage nodes will probably be 4U systems with 72 data disks each (yeah,
>> >> good times).  The problem with this becomes obvious - with the way Hadoop
>> >> works today, if a disk fails, the datanode process chokes and dies when it
>> >> tries to write to it.  We've been told repeatedly that Hadoop doesn't
>> >> perform well when it operates on RAID arrays, but, to scale efffectively,
>> >> we're going to have to do just that - three 24 disk controllers in RAID-6
>> >> mode.  How bad is this going to be?  JBOD just doesn't scale beyond a couple
>> >> disks per machine, the failure rate will knock machines out of the cluster
>> >> too often (and at 60TB per node, rebalancing will take forever, even if I
>> >> let it saturate gigabit).
>> >>
>> >> I appreciate opinions and suggestions.  Thanks!
>> >>
>> >> -j
>>
>
>
>

Re: DataNode internal balancing, performance recommendations

Posted by Jonathan Disher <jd...@parad.net>.
That's what we've been doing.  Again, the problem is, we still have to pull the datanode out of rotation and change config, replace disk, put it back... even if I have spares on hand and finish this in a few minutes, I still have one empty disk and many tens of not-empty disks.  Monitoring and identifying the failure isn't the problem, we have that down pat.  I'm hoping for a better way to re-balance the disks in the node after a failure.  I suspect the sad answer is that what I'm doing now is the best thing for it.

-j

On Jan 3, 2011, at 10:21 PM, Esteban Gutierrez Moguel wrote:

> 
> Jonathan,
> 
> Hadoop will throw an exception according to the kind of error: AccessControlException if its permission related or IOException for any other disk related task.
> 
> A safer approach to handle physical failures would be monitoring syslog messages (Syslog4j, nagios, ganglia, etc.) and if you are lucky enough and the node doesn't hangs after the disk failure, you could shutdown it gracefully.
> 
> esteban.
> 
> On Mon, Jan 3, 2011 at 13:55, Jonathan Disher <jd...@parad.net> wrote:
> The problem is, what do you define as a failure?  If the disk is failing, writes will fail to the filesystem - how does Hadoop differentiate between permissions and physical disk failure?  They both return error.
> 
> And yeah, the idea of stopping the datanode, removing the affected mount from hdfs-site.xml, and restarting has been discussed.  The problem is, when that disk gets replaced, and readded, then I have horrible internal balance issues.  Thus causing the problem I have now :(
> 
> -j
> 
> On Jan 3, 2011, at 9:07 AM, Eli Collins wrote:
> 
> > Hey Jonathan,
> >
> > There's an option (dfs.datanode.failed.volumes.tolerated, introduced
> > in HDFS-1161) that allows you to specify the number of volumes that
> > are allowed to fail before a datanode stops offering service.
> >
> > There's an operational issue that still needs to be addressed
> > (HDFS-1158) that you should be aware of - the DN will still not start
> > if any of the volumes have failed, so to restart the DN you'll need
> > you'll need to either unconfigure the failed volumes or fix them. I'd
> > like to make DN startup respect the config value so it tolerates
> > failed volumes on startup as well.
> >
> > Thanks,
> > Eli
> >
> > On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jd...@parad.net> wrote:
> >> I see that there was a thread on this in December, but I can't retrieve it to reply properly, oh well.
> >>
> >> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).  Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks in JBOD.  Original system config was six 1TB data disks - we added the last four disks months later.  I'm sure you can all guess, we have some interesting internal usage balancing issues on most of the nodes.  To date, when individual disks get critically low on space (earlier this week I had a node with six disks around 97% full, four around 70%), we've been pulling them from the cluster, formatting the data disks, and sticking them back in (with a rebalance running to keep the cluster in some semblance of order).
> >>
> >> Obviously if there was a better way to do this, I'd love to see it.  I see that there are recommendations of killing the DataNode process and manually moving files, but my concern is that the DataNode process will spend an enormous amount of time tracking down these moves (currently around 820,000 blocks/node).  And it's not necessarily easy to automate, so there's the danger of nuking blocks, and making the problems worse.  Are there alternatives to manual moves (or more automated ways that exist)?  Or has my brute-force rebalance got the best chance of success, albeit slowly?
> >>
> >> We are also building a new cluster - starting around 1.2PB raw, eventually growing to around 5PB, for near-line storage of data.  Our storage nodes will probably be 4U systems with 72 data disks each (yeah, good times).  The problem with this becomes obvious - with the way Hadoop works today, if a disk fails, the datanode process chokes and dies when it tries to write to it.  We've been told repeatedly that Hadoop doesn't perform well when it operates on RAID arrays, but, to scale efffectively, we're going to have to do just that - three 24 disk controllers in RAID-6 mode.  How bad is this going to be?  JBOD just doesn't scale beyond a couple disks per machine, the failure rate will knock machines out of the cluster too often (and at 60TB per node, rebalancing will take forever, even if I let it saturate gigabit).
> >>
> >> I appreciate opinions and suggestions.  Thanks!
> >>
> >> -j
> 
> 


Re: DataNode internal balancing, performance recommendations

Posted by Esteban Gutierrez Moguel <es...@gmail.com>.
Jonathan,

Hadoop will throw an exception according to the kind of error:
AccessControlException if its permission related or IOException for any
other disk related task.

A safer approach to handle physical failures would be monitoring syslog
messages (Syslog4j, nagios, ganglia, etc.) and if you are lucky enough and
the node doesn't hangs after the disk failure, you could shutdown it
gracefully.

esteban.

On Mon, Jan 3, 2011 at 13:55, Jonathan Disher <jd...@parad.net> wrote:

> The problem is, what do you define as a failure?  If the disk is failing,
> writes will fail to the filesystem - how does Hadoop differentiate between
> permissions and physical disk failure?  They both return error.
>
> And yeah, the idea of stopping the datanode, removing the affected mount
> from hdfs-site.xml, and restarting has been discussed.  The problem is, when
> that disk gets replaced, and readded, then I have horrible internal balance
> issues.  Thus causing the problem I have now :(
>
> -j
>
> On Jan 3, 2011, at 9:07 AM, Eli Collins wrote:
>
> > Hey Jonathan,
> >
> > There's an option (dfs.datanode.failed.volumes.tolerated, introduced
> > in HDFS-1161) that allows you to specify the number of volumes that
> > are allowed to fail before a datanode stops offering service.
> >
> > There's an operational issue that still needs to be addressed
> > (HDFS-1158) that you should be aware of - the DN will still not start
> > if any of the volumes have failed, so to restart the DN you'll need
> > you'll need to either unconfigure the failed volumes or fix them. I'd
> > like to make DN startup respect the config value so it tolerates
> > failed volumes on startup as well.
> >
> > Thanks,
> > Eli
> >
> > On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jd...@parad.net>
> wrote:
> >> I see that there was a thread on this in December, but I can't retrieve
> it to reply properly, oh well.
> >>
> >> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).
>  Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks
> in JBOD.  Original system config was six 1TB data disks - we added the last
> four disks months later.  I'm sure you can all guess, we have some
> interesting internal usage balancing issues on most of the nodes.  To date,
> when individual disks get critically low on space (earlier this week I had a
> node with six disks around 97% full, four around 70%), we've been pulling
> them from the cluster, formatting the data disks, and sticking them back in
> (with a rebalance running to keep the cluster in some semblance of order).
> >>
> >> Obviously if there was a better way to do this, I'd love to see it.  I
> see that there are recommendations of killing the DataNode process and
> manually moving files, but my concern is that the DataNode process will
> spend an enormous amount of time tracking down these moves (currently around
> 820,000 blocks/node).  And it's not necessarily easy to automate, so there's
> the danger of nuking blocks, and making the problems worse.  Are there
> alternatives to manual moves (or more automated ways that exist)?  Or has my
> brute-force rebalance got the best chance of success, albeit slowly?
> >>
> >> We are also building a new cluster - starting around 1.2PB raw,
> eventually growing to around 5PB, for near-line storage of data.  Our
> storage nodes will probably be 4U systems with 72 data disks each (yeah,
> good times).  The problem with this becomes obvious - with the way Hadoop
> works today, if a disk fails, the datanode process chokes and dies when it
> tries to write to it.  We've been told repeatedly that Hadoop doesn't
> perform well when it operates on RAID arrays, but, to scale efffectively,
> we're going to have to do just that - three 24 disk controllers in RAID-6
> mode.  How bad is this going to be?  JBOD just doesn't scale beyond a couple
> disks per machine, the failure rate will knock machines out of the cluster
> too often (and at 60TB per node, rebalancing will take forever, even if I
> let it saturate gigabit).
> >>
> >> I appreciate opinions and suggestions.  Thanks!
> >>
> >> -j
>
>

Re: DataNode internal balancing, performance recommendations

Posted by Jonathan Disher <jd...@parad.net>.
The problem is, what do you define as a failure?  If the disk is failing, writes will fail to the filesystem - how does Hadoop differentiate between permissions and physical disk failure?  They both return error.

And yeah, the idea of stopping the datanode, removing the affected mount from hdfs-site.xml, and restarting has been discussed.  The problem is, when that disk gets replaced, and readded, then I have horrible internal balance issues.  Thus causing the problem I have now :(

-j

On Jan 3, 2011, at 9:07 AM, Eli Collins wrote:

> Hey Jonathan,
> 
> There's an option (dfs.datanode.failed.volumes.tolerated, introduced
> in HDFS-1161) that allows you to specify the number of volumes that
> are allowed to fail before a datanode stops offering service.
> 
> There's an operational issue that still needs to be addressed
> (HDFS-1158) that you should be aware of - the DN will still not start
> if any of the volumes have failed, so to restart the DN you'll need
> you'll need to either unconfigure the failed volumes or fix them. I'd
> like to make DN startup respect the config value so it tolerates
> failed volumes on startup as well.
> 
> Thanks,
> Eli
> 
> On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jd...@parad.net> wrote:
>> I see that there was a thread on this in December, but I can't retrieve it to reply properly, oh well.
>> 
>> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).  Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks in JBOD.  Original system config was six 1TB data disks - we added the last four disks months later.  I'm sure you can all guess, we have some interesting internal usage balancing issues on most of the nodes.  To date, when individual disks get critically low on space (earlier this week I had a node with six disks around 97% full, four around 70%), we've been pulling them from the cluster, formatting the data disks, and sticking them back in (with a rebalance running to keep the cluster in some semblance of order).
>> 
>> Obviously if there was a better way to do this, I'd love to see it.  I see that there are recommendations of killing the DataNode process and manually moving files, but my concern is that the DataNode process will spend an enormous amount of time tracking down these moves (currently around 820,000 blocks/node).  And it's not necessarily easy to automate, so there's the danger of nuking blocks, and making the problems worse.  Are there alternatives to manual moves (or more automated ways that exist)?  Or has my brute-force rebalance got the best chance of success, albeit slowly?
>> 
>> We are also building a new cluster - starting around 1.2PB raw, eventually growing to around 5PB, for near-line storage of data.  Our storage nodes will probably be 4U systems with 72 data disks each (yeah, good times).  The problem with this becomes obvious - with the way Hadoop works today, if a disk fails, the datanode process chokes and dies when it tries to write to it.  We've been told repeatedly that Hadoop doesn't perform well when it operates on RAID arrays, but, to scale efffectively, we're going to have to do just that - three 24 disk controllers in RAID-6 mode.  How bad is this going to be?  JBOD just doesn't scale beyond a couple disks per machine, the failure rate will knock machines out of the cluster too often (and at 60TB per node, rebalancing will take forever, even if I let it saturate gigabit).
>> 
>> I appreciate opinions and suggestions.  Thanks!
>> 
>> -j


Re: DataNode internal balancing, performance recommendations

Posted by Eli Collins <el...@cloudera.com>.
Hey Jonathan,

There's an option (dfs.datanode.failed.volumes.tolerated, introduced
in HDFS-1161) that allows you to specify the number of volumes that
are allowed to fail before a datanode stops offering service.

There's an operational issue that still needs to be addressed
(HDFS-1158) that you should be aware of - the DN will still not start
if any of the volumes have failed, so to restart the DN you'll need
you'll need to either unconfigure the failed volumes or fix them. I'd
like to make DN startup respect the config value so it tolerates
failed volumes on startup as well.

Thanks,
Eli

On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jd...@parad.net> wrote:
> I see that there was a thread on this in December, but I can't retrieve it to reply properly, oh well.
>
> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).  Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks in JBOD.  Original system config was six 1TB data disks - we added the last four disks months later.  I'm sure you can all guess, we have some interesting internal usage balancing issues on most of the nodes.  To date, when individual disks get critically low on space (earlier this week I had a node with six disks around 97% full, four around 70%), we've been pulling them from the cluster, formatting the data disks, and sticking them back in (with a rebalance running to keep the cluster in some semblance of order).
>
> Obviously if there was a better way to do this, I'd love to see it.  I see that there are recommendations of killing the DataNode process and manually moving files, but my concern is that the DataNode process will spend an enormous amount of time tracking down these moves (currently around 820,000 blocks/node).  And it's not necessarily easy to automate, so there's the danger of nuking blocks, and making the problems worse.  Are there alternatives to manual moves (or more automated ways that exist)?  Or has my brute-force rebalance got the best chance of success, albeit slowly?
>
> We are also building a new cluster - starting around 1.2PB raw, eventually growing to around 5PB, for near-line storage of data.  Our storage nodes will probably be 4U systems with 72 data disks each (yeah, good times).  The problem with this becomes obvious - with the way Hadoop works today, if a disk fails, the datanode process chokes and dies when it tries to write to it.  We've been told repeatedly that Hadoop doesn't perform well when it operates on RAID arrays, but, to scale efffectively, we're going to have to do just that - three 24 disk controllers in RAID-6 mode.  How bad is this going to be?  JBOD just doesn't scale beyond a couple disks per machine, the failure rate will knock machines out of the cluster too often (and at 60TB per node, rebalancing will take forever, even if I let it saturate gigabit).
>
> I appreciate opinions and suggestions.  Thanks!
>
> -j