You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Ritch <da...@gmail.com> on 2009/03/05 17:20:01 UTC

System Layout Best Practices

Are there any published guidelines on system configuration for Hadoop?

I've seen hardware suggestions, but I'm really interested in recommendations
on disk layout and partitioning.  The defaults, as shipped and defined in
hadoop-default.xml, may be appropriate for testing, but are not really
appropriate for sustained use.  For example, data and metadata are both
stored in /tmp.  In typical use on a cluster with a couple hundred nodes,
the NameNode can generate 3-5GB of logs per day.  If you configure your
namenode host badly, it's easy to fill up the partition used by dfs for
metadata, and clobber your dfs filesystem.  I would think that thresholding
logs on WARN would be preferable to INFO.

On a datanode, we would like to reserve as much space as we can for data,
but we know that map-reduce jobs need some local storage.  How do people
generally estimate the amount of space required for temporary storage?  I
would assume that it would be good to partition it from data storage, to
prevent running out of temp space on some nodes.  I would also think that it
would be preferable for performance to have temp space on a different
spindle, so it and hdfs data can be accessed independently.

I would be interested to know how other sites configure their systems, and I
would love to see some guidelines for system configuration for Hadoop.

Thank you!

David

Re: System Layout Best Practices

Posted by David Ritch <da...@gmail.com>.
Thank you - that certainly is useful, and I would love to see more
information and discussion on that sort of thing.  However, I'm also looking
for some lower-level configuration, such as disk partitioning.

David

On Thu, Mar 5, 2009 at 11:36 AM, Sandy <sn...@gmail.com> wrote:

> Hi David,
>
> I don't know if you've seen this already, but this might be of some help:
> http://hadoop.apache.org/core/docs/r0.18.3/cluster_setup.html
>
> Near the bottom, there is a section called "Real-World Cluster
> Configurations" with some sample configuration parameters that were used to
> run a very large sort benchmark.
>
> All the best,
> -SM
>
> On Thu, Mar 5, 2009 at 10:20 AM, David Ritch <da...@gmail.com>
> wrote:
>
> > Are there any published guidelines on system configuration for Hadoop?
> >
> > I've seen hardware suggestions, but I'm really interested in
> > recommendations
> > on disk layout and partitioning.  The defaults, as shipped and defined in
> > hadoop-default.xml, may be appropriate for testing, but are not really
> > appropriate for sustained use.  For example, data and metadata are both
> > stored in /tmp.  In typical use on a cluster with a couple hundred nodes,
> > the NameNode can generate 3-5GB of logs per day.  If you configure your
> > namenode host badly, it's easy to fill up the partition used by dfs for
> > metadata, and clobber your dfs filesystem.  I would think that
> thresholding
> > logs on WARN would be preferable to INFO.
> >
> > On a datanode, we would like to reserve as much space as we can for data,
> > but we know that map-reduce jobs need some local storage.  How do people
> > generally estimate the amount of space required for temporary storage?  I
> > would assume that it would be good to partition it from data storage, to
> > prevent running out of temp space on some nodes.  I would also think that
> > it
> > would be preferable for performance to have temp space on a different
> > spindle, so it and hdfs data can be accessed independently.
> >
> > I would be interested to know how other sites configure their systems,
> and
> > I
> > would love to see some guidelines for system configuration for Hadoop.
> >
> > Thank you!
> >
> > David
> >
>

Re: System Layout Best Practices

Posted by Sandy <sn...@gmail.com>.
Hi David,

I don't know if you've seen this already, but this might be of some help:
http://hadoop.apache.org/core/docs/r0.18.3/cluster_setup.html

Near the bottom, there is a section called "Real-World Cluster
Configurations" with some sample configuration parameters that were used to
run a very large sort benchmark.

All the best,
-SM

On Thu, Mar 5, 2009 at 10:20 AM, David Ritch <da...@gmail.com> wrote:

> Are there any published guidelines on system configuration for Hadoop?
>
> I've seen hardware suggestions, but I'm really interested in
> recommendations
> on disk layout and partitioning.  The defaults, as shipped and defined in
> hadoop-default.xml, may be appropriate for testing, but are not really
> appropriate for sustained use.  For example, data and metadata are both
> stored in /tmp.  In typical use on a cluster with a couple hundred nodes,
> the NameNode can generate 3-5GB of logs per day.  If you configure your
> namenode host badly, it's easy to fill up the partition used by dfs for
> metadata, and clobber your dfs filesystem.  I would think that thresholding
> logs on WARN would be preferable to INFO.
>
> On a datanode, we would like to reserve as much space as we can for data,
> but we know that map-reduce jobs need some local storage.  How do people
> generally estimate the amount of space required for temporary storage?  I
> would assume that it would be good to partition it from data storage, to
> prevent running out of temp space on some nodes.  I would also think that
> it
> would be preferable for performance to have temp space on a different
> spindle, so it and hdfs data can be accessed independently.
>
> I would be interested to know how other sites configure their systems, and
> I
> would love to see some guidelines for system configuration for Hadoop.
>
> Thank you!
>
> David
>