You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Oded Rosen <od...@legolas-media.com> on 2011/08/10 11:22:54 UTC

Dedicated disk for operating system

Hi,
What is the best practice regarding disk allocation on hadoop data nodes?
We plan on having multiple storage disks per node, and we want to know if we should save a smaller, separate disk for the os (centos).
Is it the suggested configuration, or is it ok to let the OS reside on one of the HDFS storage disks?

Thanks,

Oded Rosen

Re: Dedicated disk for operating system

Posted by Allen Wittenauer <aw...@apache.org>.

On Aug 10, 2011, at 7:56 AM, Evert Lammerts wrote:

> A short, slightly off-topic question:
> 
>>      Also note that in this configuration that one cannot take
>> advantage of the "keep the machine up at all costs" features in newer
>> Hadoop's, which require that root, swap, and the log area be mirrored
>> to be truly effective.  I'm not quite convinced that those features are
>> worth it yet for anything smaller than maybe a 12 disk config.
> 
> Dell and Cloudera promote the C2100. I'd like to see the calculations behind that config.

	If Dell is shipping the same box they shipped us to test a few months ago, the performance was pretty horrid vs. almost all their competitors.  The main problem was the controller--it was built for RAID, not for JBOD.  (... and then there is the OOB support...)


> Am I wrong thinking that keeping your cluster up with such dense nodes will only work if you have many (order of magnitude 100+) of them, and interconnected with 10Gb Ethernet? If you don't then recovery times from failing disks / rack switches are going to get crazy, right?

	If one assumes that a bunch of nodes are failing at once, yes.  The irony is that ops teams tend to group repairs, so keeping them up might actually be the wrong thing in relation to actual practice.

> If you want to get bang for buck, don't the proportions "disk IO / processing power", "node storage capacity / ethernet speed" and "total amount of nodes / ethernet speed", indicate many small nodes with not too many disks and 1Gb Ethernet?

	The biggest constraint is almost always RAM, as you can use it to help with the rest.

Re: Dedicated disk for operating system

Posted by Ted Dunning <td...@maprtech.com>.

Luke,

Yes, I do have some data to back this up, but I think I mentioned that this
was just the back of an envelope type computation.  As such, it necessarily
ignores a number of factors.

Can you say what specifically it is that you object to?  Is the analysis
pessimistic or optimistic?  Are you seeing lots of correlated failures?  I
presume that your 40,000+ nodes are not in a single cluster and thus have
different failure modes than I was talking about.  Perhaps you could say
more about your situation.

In many installations, duty factor is low enough that average failure rate
can be an order of magnitude lower than what I quoted.  Even so, I don't
feel comfortable using that kind of rate for a computation of this sort.

On Wed, Aug 10, 2011 at 12:19 PM, Luke Lu <ll...@vicaya.com> wrote:

> On Wed, Aug 10, 2011 at 10:40 AM, Ted Dunning <td...@maprtech.com>
> wrote:
> > To be specific, taking a 100 node x 10 disk x 2 TB configuration with
> drive
> > MTBF of 1000 days, we should be seeing drive failures on average once per
> > day....
> > For a 10,000 node cluster, however, we should expect the average rate of
> > disk failure rate of one failure every 2.5 hours.
>
> Do you have real data to back the analysis? You assume a uniform disk
> failure distribution, which is absolutely not true. I can only say
> that our ops data across 40000+ nodes shows that the above analysis is
> not even close. (This is assuming that the ops know what they are
> doing though :)
>
> __Luke
>

Re: Dedicated disk for operating system

Posted by Steve Loughran <st...@apache.org>.

On 10/08/2011 20:31, Brian Bockelman wrote:
> MTTF is a difficult number.  Popular papers include: http://db.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html, http://labs.google.com/papers/disk_failures.pdf
>
> Ted is assuming a MTTF of 25kHours; I think that's overly pessimistic, although both papers indicate that MTTF is a crappy way to model disk lifetime.
>

see also [Gray05] 
http://research.microsoft.com/apps/pubs/default.aspx?id=64599

Re: Dedicated disk for operating system

Posted by Ted Dunning <td...@maprtech.com>.

Agreed on both points.

Of course, if you are doing a failure analysis it is something of a
professional obligation to be somewhat pessimistic.  For one thing, being
pessimistic on the factors you know about may compensate to some degree for
your inadvertent optimism regarding the factors you don't know about or
think are minor.

In general, assuming stable distributions with finite variance is a bit
dangerous in any case.  On the other hand, doing any real modeling with Levy
distributions is hard enough that you probably gain less insight for being
more correct.  It is worth incorporating correlated failures into your model
if you can but even that is non-trivial.

On Wed, Aug 10, 2011 at 12:31 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Ted is assuming a MTTF of 25kHours; I think that's overly pessimistic,
> although both papers indicate that MTTF is a crappy way to model disk
> lifetime.
>

Re: Dedicated disk for operating system

Posted by Brian Bockelman <bb...@cse.unl.edu>.

MTTF is a difficult number.  Popular papers include: http://db.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html, http://labs.google.com/papers/disk_failures.pdf

Ted is assuming a MTTF of 25kHours; I think that's overly pessimistic, although both papers indicate that MTTF is a crappy way to model disk lifetime.

I think a lot has to do with the quality of the batch of hard drives you get and operating conditions.

Brain

On Aug 10, 2011, at 2:19 PM, Luke Lu wrote:

> On Wed, Aug 10, 2011 at 10:40 AM, Ted Dunning <td...@maprtech.com> wrote:
>> To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive
>> MTBF of 1000 days, we should be seeing drive failures on average once per
>> day....
>> For a 10,000 node cluster, however, we should expect the average rate of
>> disk failure rate of one failure every 2.5 hours.
> 
> Do you have real data to back the analysis? You assume a uniform disk
> failure distribution, which is absolutely not true. I can only say
> that our ops data across 40000+ nodes shows that the above analysis is
> not even close. (This is assuming that the ops know what they are
> doing though :)
> 
> __Luke

Re: Dedicated disk for operating system

Posted by Luke Lu <ll...@vicaya.com>.

On Wed, Aug 10, 2011 at 10:40 AM, Ted Dunning <td...@maprtech.com> wrote:
> To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive
> MTBF of 1000 days, we should be seeing drive failures on average once per
> day....
> For a 10,000 node cluster, however, we should expect the average rate of
> disk failure rate of one failure every 2.5 hours.

Do you have real data to back the analysis? You assume a uniform disk
failure distribution, which is absolutely not true. I can only say
that our ops data across 40000+ nodes shows that the above analysis is
not even close. (This is assuming that the ops know what they are
doing though :)

__Luke

Re: Dedicated disk for operating system

Posted by Steve Loughran <st...@apache.org>.

On 11/08/2011 01:15, Rajiv Chittajallu wrote:
> Ted Dunning wrote on 08/10/11 at 10:40:30 -0700:
>> To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive
>> MTBF of 1000 days, we should be seeing drive failures on average once per
>> day.  With 1G ethernet and 30MB/s/node dedicated to re-replication, it will
>> just over 10 minutes to restore replication of a single drive and will take
>> just over 100 minutes to restore replication of an entire machine.
>
> You are assuming that only one good node is used to restore replication for
> all the blocks on the failed drive. Which is very unlikely. With
> replication factor of 3, you will have at least 2 nodes to choose from
> in the worst case and much more in a standard cluster.
>
> And when you are having more spindles, 6+, one would probably consider
> using the second GigE port, which is standard on most of the commodity
> gear out there.
>
>

I'd be willing to collaborate on writing some paper on these issues if 
someone has data they can share; we can look at the observed/predicted 
failure rates of 12x2TB HDD systems, and discuss actions (moving blocks 
to any local space would be better than over LAN, for example), compare 
2x1Gbe with 1x10Gbe for recovery (and ignoring switch cost)

Re: Dedicated disk for operating system

Posted by Ted Dunning <td...@maprtech.com>.

Actually I was assuming that the entire cluster participates in the
rebalancing.  Repication is not done disk-wise in hadoop but block-wise.

On Wednesday, August 10, 2011, Rajiv Chittajallu <ra...@yahoo-inc.com>
wrote:
> Ted Dunning wrote on 08/10/11 at 10:40:30 -0700:
>>To be specific, taking a 100 node x 10 disk x 2 TB configuration with
drive
>>MTBF of 1000 days, we should be seeing drive failures on average once per
>>day.  With 1G ethernet and 30MB/s/node dedicated to re-replication, it
will
>>just over 10 minutes to restore replication of a single drive and will
take
>>just over 100 minutes to restore replication of an entire machine.
>
> You are assuming that only one good node is used to restore replication
for
> all the blocks on the failed drive. Which is very unlikely. With
> replication factor of 3, you will have at least 2 nodes to choose from
> in the worst case and much more in a standard cluster.
>
> And when you are having more spindles, 6+, one would probably consider
> using the second GigE port, which is standard on most of the commodity
> gear out there.
>
>
>

Re: Dedicated disk for operating system

Posted by Rajiv Chittajallu <ra...@yahoo-inc.com>.

Ted Dunning wrote on 08/10/11 at 10:40:30 -0700:
>To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive
>MTBF of 1000 days, we should be seeing drive failures on average once per
>day.  With 1G ethernet and 30MB/s/node dedicated to re-replication, it will
>just over 10 minutes to restore replication of a single drive and will take
>just over 100 minutes to restore replication of an entire machine.  

You are assuming that only one good node is used to restore replication for 
all the blocks on the failed drive. Which is very unlikely. With
replication factor of 3, you will have at least 2 nodes to choose from
in the worst case and much more in a standard cluster. 

And when you are having more spindles, 6+, one would probably consider
using the second GigE port, which is standard on most of the commodity
gear out there.

Re: Dedicated disk for operating system

Posted by Ted Dunning <td...@maprtech.com>.

The reliability question can be answered to first order by computing
replication time for a unit of storage and then computing how often that
replication time will contain additional failures sufficient to cause data
loss.  Such data loss events should be roughly Poisson distributed with rate
equal to the rate of the original failures times the probability that any
failure actually is a data loss.  Second order effects appear when one
replication spills into the next increasing the replication period for the
second event.  It is difficult to impossible to account for all of the
second order effects in closed form and I have found it necessary to resort
to discrete event simulation to estimate failure mode probabilities in
detail.  For small numbers of disks per node, one second order effect that
becomes important is the node failure rate.

Grouping disks into storage groups or failing an entire node when one disk
fails are ways that the storage units are larger than individual disks.  Use
of a volume manager or RAID-0 will increase the storage unit size.

These failure modes drive some limitations on cluster size since the
absolute rate of storage unit failures increases with cluster size.  For a
fixed number of drives in each storage unit, the limiting factor is the
total number of disk drives, not the number of nodes.  For older versions of
Hadoop, the storage unit was all drives on the system which is quite
dangerous in terms of mean time to data loss.  More recently, a fix has been
committed to trunk (and I think .204, Todd will correct me if I am  wrong)
that makes the storage unit equal to a single drive.  In the previous
situation, it was dangerous to have too many drives on each node in large
clusters.  With single disk storage units, the number of drives per machine
does not matter in this computation.

To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive
MTBF of 1000 days, we should be seeing drive failures on average once per
day.  With 1G ethernet and 30MB/s/node dedicated to re-replication, it will
just over 10 minutes to restore replication of a single drive and will take
just over 100 minutes to restore replication of an entire machine.  The
probability of 2 disk failures during the 15 minutes after a failure is
roughly \lambda^2 e^-\lambda / 2 where \lambda = 15 minutes / 24 hours.
 This is a small probability so average times between data loss should be
relatively long.  For the larger storage unit of 10 disks, the probability
is not so small and data loss should be expected every few years or so.

For a 10,000 node cluster, however, we should expect the average rate of
disk failure rate of one failure every 2.5 hours.  Here, the number of disks
is large enough that the first order computation is much less accurate since
the placement of disk blocks across the cluster will often have more
non-uniformity due to small counts.  This non-uniformity increases the
replication recovery time.  With the large storage unit model, the
probability that three disk failures will stack up becomes unacceptably
large.  Even with the single disk storage unit, the data loss rate becomes
large enough that the cluster cannot be considered archival.

The real question about optimal configuration depends on how fast the
cluster can move data from disk.  If this rate is relatively low compared to
the hardware speeds, then supporting full performance from large numbers of
drives is very difficult.  If you can maintain high transfer rates, however,
you can substantially decrease the cost of your cluster by having fewer
nodes.

On Wed, Aug 10, 2011 at 7:56 AM, Evert Lammerts <Ev...@sara.nl>wrote:

> A short, slightly off-topic question:
>
> >       Also note that in this configuration that one cannot take
> > advantage of the "keep the machine up at all costs" features in newer
> > Hadoop's, which require that root, swap, and the log area be mirrored
> > to be truly effective.  I'm not quite convinced that those features are
> > worth it yet for anything smaller than maybe a 12 disk config.
>
> Dell and Cloudera promote the C2100. I'd like to see the calculations
> behind that config. Am I wrong thinking that keeping your cluster up with
> such dense nodes will only work if you have many (order of magnitude 100+)
> of them, and interconnected with 10Gb Ethernet? If you don't then recovery
> times from failing disks / rack switches are going to get crazy, right? If
> you want to get bang for buck, don't the proportions "disk IO / processing
> power", "node storage capacity / ethernet speed" and "total amount of nodes
> / ethernet speed", indicate many small nodes with not too many disks and 1Gb
> Ethernet?
>
> Cheers,
> Evert
>

Re: Dedicated disk for operating system

Posted by Scott Carey <sc...@richrelevance.com>.

On 8/10/11 7:56 AM, "Evert Lammerts" <Ev...@sara.nl> wrote:

>A short, slightly off-topic question:
>
>>       Also note that in this configuration that one cannot take
>> advantage of the "keep the machine up at all costs" features in newer
>> Hadoop's, which require that root, swap, and the log area be mirrored
>> to be truly effective.  I'm not quite convinced that those features are
>> worth it yet for anything smaller than maybe a 12 disk config.
>
>Dell and Cloudera promote the C2100. I'd like to see the calculations
>behind that config. Am I wrong thinking that keeping your cluster up with
>such dense nodes will only work if you have many (order of magnitude
>100+) of them, and interconnected with 10Gb Ethernet?
>If you don't then recovery times from failing disks / rack switches are
>going to get crazy, right? If you want to get bang for buck, don't the
>proportions "disk IO / processing power", "node storage capacity /
>ethernet speed" and "total amount of nodes / ethernet speed", indicate
>many small nodes with not too many disks and 1Gb Ethernet?

IMO and experience, absolutely.

Get to 40 nodes before you even think about going for high density
machines with more than 5 drives and the higher end network infrastructure
needed.

1GB ethernet (or 2x1GB bonded) with smaller machines (1 quad+ core, 4
drives) and 40 nodes will beat the same total cost cluster with larger (8+
drives, 2x quad core cpu, double the RAM) nodes (~18 or so, with more
cores but lower Mhz) every time. And your failure scenario (replicate a
40th of data versus an 18th if a node fails) is better.

In short, try to get to 50 to 100 nodes before you look at larger
machines.  For larger clusters (200 - thousands) the tradeoffs are very
different -- expensive network infrastructure is required and the
incremental cost of going to 10Gb network isn't as large.

The cost sweet spot for a server goes from ~$4k to ~$10k depending on
cluster size and whether power or space is a larger cost or limiting
factor.

FWIW, Dell r310's are decent small nodes (r410's are popular for more CPU
heavy workloads, but I'm not a fan due to the larger CPU/disk ratio and
much higher CPU$ / (CPU * Mhz) ratio).  Next gen single socket, 4 drive
servers with SandyBridge Xeon E 1200 series processors will use even less
power and have about 50% better CPU performance per core than today's
common Nehalem processors used in 2 socket servers due to much higher
clock speeds and about 15% better performance at the same clock.

The next gen Intel processors after that (Ivy Bridge) promise another big
Mhz jump without a power increase in late 2012 / early 2013.  At that
point, I expect that single socket, 4 or 6 core machines will be optimal
for a larger range of use cases than now. Per socket CPU power is
increasing at a faster rate than drive and network performance.

>
>Cheers,
>Evert

RE: Dedicated disk for operating system

Posted by Evert Lammerts <Ev...@sara.nl>.

A short, slightly off-topic question:

>       Also note that in this configuration that one cannot take
> advantage of the "keep the machine up at all costs" features in newer
> Hadoop's, which require that root, swap, and the log area be mirrored
> to be truly effective.  I'm not quite convinced that those features are
> worth it yet for anything smaller than maybe a 12 disk config.

Dell and Cloudera promote the C2100. I'd like to see the calculations behind that config. Am I wrong thinking that keeping your cluster up with such dense nodes will only work if you have many (order of magnitude 100+) of them, and interconnected with 10Gb Ethernet? If you don't then recovery times from failing disks / rack switches are going to get crazy, right? If you want to get bang for buck, don't the proportions "disk IO / processing power", "node storage capacity / ethernet speed" and "total amount of nodes / ethernet speed", indicate many small nodes with not too many disks and 1Gb Ethernet?

Cheers,
Evert

RE: Dedicated disk for operating system

Posted by Oded Rosen <od...@legolas-media.com>.

Thanks,
This is helpful

-----Original Message-----
From: Allen Wittenauer [mailto:aw@apache.org] 
Sent: Wednesday, August 10, 2011 4:50 PM
To: general@hadoop.apache.org
Subject: Re: Dedicated disk for operating system


On Aug 10, 2011, at 2:22 AM, Oded Rosen wrote:

> Hi,
> What is the best practice regarding disk allocation on hadoop data nodes?
> We plan on having multiple storage disks per node, and we want to know if we should save a smaller, separate disk for the os (centos).
> Is it the suggested configuration, or is it ok to let the OS reside on one of the HDFS storage disks?


	It's a waste to put the OS disk on a separate disk.  Every spindle = performance, esp for MR spills.

	I'm currently configuring:

disk 1 - os, swap, app area, MR spill space, HDFS space disk 2 through n - swap, MR spill space, HDFS space

	The usual reason people say to put the OS on a separate space is to make upgrades easier as you won't have to touch the application.  The reality is that you're going to blow away the entire machine during an upgrade anyway.  So don't worry about this situation. 

	I know a lot of people combine the MR spill space and HDFS space onto the same partition, but I've found that keeping them separate has  two advantages:

	* No longer have to deal with the stupid math that HDFS uses for reservation--no question as to how much space one actually has
	* A hard limit on MR space kills badly written jobs before they eat up enough space to nuke HDFS

	Of course, the big disadvantage is one needs to calculate the correct space needed, and that's a toughie.  But if you know your applications then not a problem.  Besides, if one gets it wrong, you can always do a rolling re-install to fix it.

	Also note that in this configuration that one cannot take advantage of the "keep the machine up at all costs" features in newer Hadoop's, which require that root, swap, and the log area be mirrored to be truly effective.  I'm not quite convinced that those features are worth it yet for anything smaller than maybe a 12 disk config.

Re: Dedicated disk for operating system

Posted by Scott Carey <sc...@richrelevance.com>.

On 8/10/11 6:50 AM, "Allen Wittenauer" <aw...@apache.org> wrote:

>
>On Aug 10, 2011, at 2:22 AM, Oded Rosen wrote:
>
>> Hi,
>> What is the best practice regarding disk allocation on hadoop data
>>nodes?
>> We plan on having multiple storage disks per node, and we want to know
>>if we should save a smaller, separate disk for the os (centos).
>> Is it the suggested configuration, or is it ok to let the OS reside on
>>one of the HDFS storage disks?
>
>
>	It's a waste to put the OS disk on a separate disk.  Every spindle =
>performance, esp for MR spills.
>
>	I'm currently configuring:
>
>disk 1 - os, swap, app area, MR spill space, HDFS space
>disk 2 through n - swap, MR spill space, HDFS space

We do something similar, except that disk 1 does not have MR spill space.
Disk 1 is OS, logs, swap, app area, HDFS.  Disk2 is MR spill/temp, HDFS.
Also, we put the HDFS partitions in the 'front' of the disk where
sequential transfers are faster, and the other stuff at the end.

>
>	The usual reason people say to put the OS on a separate space is to make
>upgrades easier as you won't have to touch the application.  The reality
>is that you're going to blow away the entire machine during an upgrade
>anyway.  So don't worry about this situation.
>
>	I know a lot of people combine the MR spill space and HDFS space onto
>the same partition, but I've found that keeping them separate has  two
>advantages:
>
>	* No longer have to deal with the stupid math that HDFS uses for
>reservation--no question as to how much space one actually has
>	* A hard limit on MR space kills badly written jobs before they eat up
>enough space to nuke HDFS

Furthermore, the disk performance is MUCH better if yo split them and
optimize the file system and mount parameters for the different workloads.
M/R spill in the same place as HDFS was causing a lot of random seeks for
us and throttling HDFS performance.
*  HDFS needs mostly sequential write and read optimized file system and
mount parameters (and some random read).  It is also not metadata heavy.
We found that XFS worked very well for this, and has an online
defragmenter we use to keep that partition in good shape.  We are not disk
I/O bound in HDFS with 4 drives/server this way. Ext4 is an option too,
but has no online defragmenter.  Ext3 gets really fragmented after a while
causing the system to get I/O bound more regularly as the node aged.  Ext4
should be much better at avoiding fragmentation than ext3.
*  M/R spill is metadata intensive, with many small reads and writes in
addition to larger writes and reads and files that come and go.  We found
that using ext4 for this, with optimized mount parameters
(rw,noatime,nobarrier,data=writeback,commit=30) tremendously reduced I/O
for M/R temp as many files didn't even live 30 seconds to get flushed to
disk.  These settings are not appropriate for the HDFS partition.  XFS is
a horrible option for M/R spill and temp -- it performs very poorly with
those workloads.

>
>	Of course, the big disadvantage is one needs to calculate the correct
>space needed, and that's a toughie.  But if you know your applications
>then not a problem.  Besides, if one gets it wrong, you can always do a
>rolling re-install to fix it.
>
>	Also note that in this configuration that one cannot take advantage of
>the "keep the machine up at all costs" features in newer Hadoop's, which
>require that root, swap, and the log area be mirrored to be truly
>effective.  I'm not quite convinced that those features are worth it yet
>for anything smaller than maybe a 12 disk config.

Re: Dedicated disk for operating system

Posted by Allen Wittenauer <aw...@apache.org>.

On Aug 10, 2011, at 2:22 AM, Oded Rosen wrote:

> Hi,
> What is the best practice regarding disk allocation on hadoop data nodes?
> We plan on having multiple storage disks per node, and we want to know if we should save a smaller, separate disk for the os (centos).
> Is it the suggested configuration, or is it ok to let the OS reside on one of the HDFS storage disks?


	It's a waste to put the OS disk on a separate disk.  Every spindle = performance, esp for MR spills.

	I'm currently configuring:

disk 1 - os, swap, app area, MR spill space, HDFS space
disk 2 through n - swap, MR spill space, HDFS space

	The usual reason people say to put the OS on a separate space is to make upgrades easier as you won't have to touch the application.  The reality is that you're going to blow away the entire machine during an upgrade anyway.  So don't worry about this situation. 

	I know a lot of people combine the MR spill space and HDFS space onto the same partition, but I've found that keeping them separate has  two advantages:

	* No longer have to deal with the stupid math that HDFS uses for reservation--no question as to how much space one actually has
	* A hard limit on MR space kills badly written jobs before they eat up enough space to nuke HDFS

	Of course, the big disadvantage is one needs to calculate the correct space needed, and that's a toughie.  But if you know your applications then not a problem.  Besides, if one gets it wrong, you can always do a rolling re-install to fix it.

	Also note that in this configuration that one cannot take advantage of the "keep the machine up at all costs" features in newer Hadoop's, which require that root, swap, and the log area be mirrored to be truly effective.  I'm not quite convinced that those features are worth it yet for anything smaller than maybe a 12 disk config.