You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Todd Troxell <tt...@debian.org> on 2008/04/10 04:42:36 UTC

hdfs > 100T?

Hello list,

I was unable to access the archives for this list as
http://hadoop.apache.org/mail/core-user/ returns 403.

I am interested in using HDFS for storage, and for map/reduce only
tangentially.  I see clusters mentioned in the docs with many many nodes and
9TB of disk.

Is HDFS expected to scale to > 100TB?

Does it require massive parallelism to scale to many files?  For instance, do you
think it would slow down drastically in a 2 node 32T config?

The workload is file serving at about 100mbit, 20req/sec.

Any input is appreciated :)

-- 
Todd Troxell
http://rapidpacket.com/~xtat

Re: hdfs > 100T?

Posted by Todd Troxell <tt...@debian.org>.

On Thu, Apr 10, 2008 at 09:47:59AM +0200, Mads Toftum wrote:
> On Wed, Apr 09, 2008 at 09:42:36PM -0500, Todd Troxell wrote:
> > I was unable to access the archives for this list as
> > http://hadoop.apache.org/mail/core-user/ returns 403.
> > 
> You're probably looking for
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/

Thank you.  

By the way, this is the page listing the broken link:
http://hadoop.apache.org/core/mailing_lists.html

-- 
Todd Troxell
http://rapidpacket.com/~xtat

Re: hdfs > 100T?

Posted by Mads Toftum <ma...@toftum.dk>.

On Wed, Apr 09, 2008 at 09:42:36PM -0500, Todd Troxell wrote:
> I was unable to access the archives for this list as
> http://hadoop.apache.org/mail/core-user/ returns 403.
> 
You're probably looking for
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/

vh

Mads Toftum
-- 
http://soulfood.dk

Re: hdfs > 100T?

Posted by Ted Dunning <td...@veoh.com>.

I should mention that the mogile available generally is not suitable for
large installs.

We had to make significant changes to get it to work correctly.  We are
figuring out how to contribute these back, but may have to fork the project
to do it.


On 4/10/08 12:21 PM, "Todd Troxell" <tt...@debian.org> wrote:

> On Thu, Apr 10, 2008 at 09:18:02AM -0700, Ted Dunning wrote:
>> Hadoop also does much better with spindles spread across many machines.
>> Putting 16 TB on each of two nodes is distinctly sub-optimal on many fronts.
>> Much better to put 0.5-2TB on 16-64 machines.  With 2x1TB SATA drives, your
>> cost and performance are likely to both be better than two machines with
>> storage trays (aggressive pricing right now on minimal machines with 16TB in
>> two storage trays from a major vendor is about 18K$, you should be able to
>> populate a 1U node with 2TB of disk for about $1500.  16 x 1.5K% = 24K$ <
>> 2x18K$).  The rack space requirements are about the same, but you may have
>> slightly lower power for the tray solutions.
>> 
>> On the other hand, your performance requirements are so low that you might
>> just as well off getting something like a Sun Thumper that can accommodate
>> all of your storage in a single chassis.
>> 
>> We use a mixture of both kinds of solution in our system.  We have nearly a
>> billion files stored on tray based machines using mogileFS.  One scaling
>> constraint there is the simply the management and configuration of nodes so
>> fewer machines is a small win.  We also have a modest number of TB's in a
>> more traditional hadoop cluster with small machines.
> 
> Thanks for the input.  I had been considering mogilefs as well until
> recently but I have enough machines to do any serious benchmarking of
> configurations.  This post has been very helpful.

Re: hdfs > 100T?

Posted by Todd Troxell <tt...@debian.org>.

On Thu, Apr 10, 2008 at 09:18:02AM -0700, Ted Dunning wrote:
> Hadoop also does much better with spindles spread across many machines.
> Putting 16 TB on each of two nodes is distinctly sub-optimal on many fronts.
> Much better to put 0.5-2TB on 16-64 machines.  With 2x1TB SATA drives, your
> cost and performance are likely to both be better than two machines with
> storage trays (aggressive pricing right now on minimal machines with 16TB in
> two storage trays from a major vendor is about 18K$, you should be able to
> populate a 1U node with 2TB of disk for about $1500.  16 x 1.5K% = 24K$ <
> 2x18K$).  The rack space requirements are about the same, but you may have
> slightly lower power for the tray solutions.
> 
> On the other hand, your performance requirements are so low that you might
> just as well off getting something like a Sun Thumper that can accommodate
> all of your storage in a single chassis.
> 
> We use a mixture of both kinds of solution in our system.  We have nearly a
> billion files stored on tray based machines using mogileFS.  One scaling
> constraint there is the simply the management and configuration of nodes so
> fewer machines is a small win.  We also have a modest number of TB's in a
> more traditional hadoop cluster with small machines.

Thanks for the input.  I had been considering mogilefs as well until
recently but I have enough machines to do any serious benchmarking of
configurations.  This post has been very helpful.

-- 
Todd Troxell
http://rapidpacket.com/~xtat

Re: RAID-0 vs. JBOD?

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Ted Dunning wrote:
> 
> I haven't done a detailed comparison, but I have seen some effects:
> 
> A) raid doesn't usually work really well on low-end machines compared to
> independent drives.  This would make me distrust raid.
> 
> B) hadoop doesn't do very well, historically speaking with more than one
> partition if the partitions are not roughly equal in size.  Quite frankly,
> it doesn't even do all that well with datanodes that have radically
> different storage availability.
> 
> C) with raid-0, if you lose either drive, you lose both.  With separate
> partitions, you can lose one drive and retain the other.
> 
> These lead to opposite conclusions, so I don't know what to recommend.  If I
> had to choose, I think I would do without RAID.

D) Out of 5 disks, if one of them is slow (not that uncommon), then 
whole RAID will run only as fast that disk.

On smaller cluster (<100 nodes), RAID is simpler over all since 
probability of bad disks is less.

Raghu.

Re: RAID-0 vs. JBOD?

Posted by Ted Dunning <td...@veoh.com>.

I haven't done a detailed comparison, but I have seen some effects:

A) raid doesn't usually work really well on low-end machines compared to
independent drives.  This would make me distrust raid.

B) hadoop doesn't do very well, historically speaking with more than one
partition if the partitions are not roughly equal in size.  Quite frankly,
it doesn't even do all that well with datanodes that have radically
different storage availability.

C) with raid-0, if you lose either drive, you lose both.  With separate
partitions, you can lose one drive and retain the other.

These lead to opposite conclusions, so I don't know what to recommend.  If I
had to choose, I think I would do without RAID.

On 4/10/08 10:29 AM, "Colin Evans" <co...@metaweb.com> wrote:

> We're building a cluster of 40 machines with 5 drives each, and I'm
> curious what people's experiences have been for using RAID-0 for HDFS
> vs. configuring seperate partitions (JBOD) and having the datanode
> balance between them.
> 
> I took a look at the datanode code, and datanodes appear to write blocks
> using a round-robin algorithm when managing multiple partitions.  In
> theory, the striping on RAID-0 should be more evenly balanced than this,
> but RAID-0 doesn't seem to give a speedup proportionate to the number of
> drives being striped.  Furthermore, our initial tests seem to suggest
> that the JBOD configuration spends less time in wait state than the
> RAID-0 configuration when running disk-bound jobs.
> 
> We're still tweaking our own benchmarks, so we don't have any conclusive
> results yet.  Has anyone done this kind of comparison before?
> 
>

RAID-0 vs. JBOD?

Posted by Colin Evans <co...@metaweb.com>.

We're building a cluster of 40 machines with 5 drives each, and I'm 
curious what people's experiences have been for using RAID-0 for HDFS 
vs. configuring seperate partitions (JBOD) and having the datanode 
balance between them. 

I took a look at the datanode code, and datanodes appear to write blocks 
using a round-robin algorithm when managing multiple partitions.  In 
theory, the striping on RAID-0 should be more evenly balanced than this, 
but RAID-0 doesn't seem to give a speedup proportionate to the number of 
drives being striped.  Furthermore, our initial tests seem to suggest 
that the JBOD configuration spends less time in wait state than the 
RAID-0 configuration when running disk-bound jobs.

We're still tweaking our own benchmarks, so we don't have any conclusive 
results yet.  Has anyone done this kind of comparison before?

Re: hdfs > 100T?

Posted by Ted Dunning <td...@veoh.com>.

Hadoop also does much better with spindles spread across many machines.
Putting 16 TB on each of two nodes is distinctly sub-optimal on many fronts.
Much better to put 0.5-2TB on 16-64 machines.  With 2x1TB SATA drives, your
cost and performance are likely to both be better than two machines with
storage trays (aggressive pricing right now on minimal machines with 16TB in
two storage trays from a major vendor is about 18K$, you should be able to
populate a 1U node with 2TB of disk for about $1500.  16 x 1.5K% = 24K$ <
2x18K$).  The rack space requirements are about the same, but you may have
slightly lower power for the tray solutions.

On the other hand, your performance requirements are so low that you might
just as well off getting something like a Sun Thumper that can accommodate
all of your storage in a single chassis.

We use a mixture of both kinds of solution in our system.  We have nearly a
billion files stored on tray based machines using mogileFS.  One scaling
constraint there is the simply the management and configuration of nodes so
fewer machines is a small win.  We also have a modest number of TB's in a
more traditional hadoop cluster with small machines.

On 4/10/08 12:57 AM, "Allen Wittenauer" <aw...@yahoo-inc.com> wrote:

> On 4/10/08 4:42 AM, "Todd Troxell" <tt...@debian.org> wrote:
>> Hello list,
> 
>     Howdy.
> 
>> I am interested in using HDFS for storage, and for map/reduce only
>> tangentially.  I see clusters mentioned in the docs with many many nodes and
>> 9TB of disk.
>> 
>> Is HDFS expected to scale to > 100TB?
> 
>     We're running file systems in the 2-6PB range.
> 
>> Does it require massive parallelism to scale to many files?  For instance, do
>> you think it would slow down drastically in a 2 node 32T config?
> 
>     The biggest gotcha is the name node.  You need to feed it lots and lots
> of memory.  Keep in mind that Hadoop functions better with fewer large files
> than many small ones.
> 
>

Re: hdfs > 100T?

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.

On 4/10/08 4:42 AM, "Todd Troxell" <tt...@debian.org> wrote:
> Hello list,

    Howdy.

> I am interested in using HDFS for storage, and for map/reduce only
> tangentially.  I see clusters mentioned in the docs with many many nodes and
> 9TB of disk.
> 
> Is HDFS expected to scale to > 100TB?

    We're running file systems in the 2-6PB range.

> Does it require massive parallelism to scale to many files?  For instance, do
> you think it would slow down drastically in a 2 node 32T config?

    The biggest gotcha is the name node.  You need to feed it lots and lots
of memory.  Keep in mind that Hadoop functions better with fewer large files
than many small ones.