You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Sean Bigdatafun <se...@gmail.com> on 2011/02/01 03:34:17 UTC

Re: HDFS without Hadoop: Why?

I feel this is a great discussion, so let's think of HDFS' customers.

(1) MapReduce --- definitely a perfect fit as Nathan has pointed out
*(2) HBase --- it seems HBase (Bigtable's log structured file) did a great
job on this. The solution comes out of Google, it must be right. But would
Google necessarily has chosen this approach in its Bigtable system should
GFS did not exist in the first place? i.e, can we have alternative 'best'
approach?*


Anything else? I do not think HDFS is a good file system choice for
enterprise applications.

On Tue, Jan 25, 2011 at 12:37 PM, Nathan Rutman <nr...@gmail.com> wrote:

> I have a very general question on the usefulness of HDFS for purposes other
> than running distributed compute jobs for Hadoop.  Hadoop and HDFS seem very
> popular these days, but the use of HDFS for other purposes (database
> backend, records archiving, etc) confuses me, since there are other free
> distributed filesystems out there (I personally work on Lustre), with
> significantly better general-purpose performance.
>
> So please tell me if I'm wrong about any of this.  Note I've gathered most
> of my info from documentation rather than reading the source code.
>
> As I understand it, HDFS was written specifically for Hadoop compute jobs,
> with the following design factors in mind:
>
>    - write-once-read-many (worm) access model
>    - use commodity hardware with relatively high failures rates (i.e.
>    assumptive failures)
>    - long, sequential streaming data access
>    - large files
>    - hardware/OS agnostic
>    - moving computation is cheaper than moving data
>
>
> While appropriate for processing many large-input Hadoop data-processing
> jobs, there are significant penalties to be paid when trying to use these
> design factors for more general-purpose storage:
>
>    - Commodity hardware requires data replication for safety.  The HDFS
>    implementation has three penalties: storage redundancy, network loading, and
>    blocking writes.  By default, HDFS blocks are replicated 3x: local,
>    "nearby", and "far away" to minimize the impact of data center catastrophe.
>     In addition to the obvious 3x cost for storage, the result is that every
>    data block must be written "far away" - exactly the opposite of the "Move
>    Computation to Data" mantra.  Furthermore, these over-network writes are
>    synchronous; the client write blocks until all copies are complete on disk,
>    with the longest latency path of 2 network hops plus a disk write gating the
>    overall write speed.   Note that while this would be disastrous for a
>    general-purpose filesystem, with true WORM usage it may be acceptable to
>    penalize writes this way.
>
> Facebook seems to have a more cost effective way to do replication, but I
am not sure about its MapReduce performance -- at the end of the day, there
are only two 'proper' map slot machines that can host a 'cheap' mapper
operation.


>
>    - Large block size implies fewer files.  HDFS reaches limits in the
>    tens of millions of files.
>    - Large block size wastes space for small file.  The minimum file size
>    is 1 block.
>    - There is no data caching.  When delivering large contiguous streaming
>    data, this doesn't matter.  But when the read load is random, seeky, or
>    partial, this is a missing high-impact performance feature.
>
> Yes, can anyone answer this question? -- I want to ask the same question as
well.

>
>    - In a WORM model, changing a small part of a file requires all the
>    file data to be copied, so e.g. database record modifications would be very
>    expensive.
>
> Yes, can anyone answer this question?

>
>    - There are no hardlinks, softlinks, or quotas.
>    - HDFS isn't directly mountable, and therefore requires a non-standard
>    API to use.  (FUSE workaround exists.)
>    - Java source code is very portable and easy to install, but not very
>    quick.
>    - Moving computation is cheaper than moving data.  But the data
>    nonetheless always has to be moved: either read off of a local hard drive or
>    read over the network into the compute node's memory.  It is not necessarily
>    the case that reading a local hard drive is faster than reading a
>    distributed (striped) file over a fast network.  Commodity network (e.g.
>    1GigE), probably yes.  But a fast (and expensive) network (e.g. 4xDDR
>    Infiniband) can deliver data significantly faster than a local commodity
>    hard drive.
>
> I agree with this statement: "It is not necessarily the case that reading a
local hard drive is faster than reading a distributed (striped) file over a
fast network. ", probably Infiniband as well as 10GigE network. And this is
why I feel it might not be a good strategy that HBase entirely attach its
design to HDFS.



> If I'm missing other points, pro- or con-, I would appreciate hearing them.
>  Note again I'm not questioning the success of HDFS in achieving those
> stated design choices, but rather trying to understand HDFS's applicability
> to other storage domains beyond Hadoop.
>
> Thanks for your time.
>
>


-- 
--Sean

Re: HDFS without Hadoop: Why?

Posted by Nathan Rutman <nr...@gmail.com>.

On Mon, Jan 31, 2011 at 6:34 PM, Sean Bigdatafun
<se...@gmail.com> wrote:
> I feel this is a great discussion, so let's think of HDFS' customers.
> (1) MapReduce --- definitely a perfect fit as Nathan has pointed out
I would add the caveat that this depends on your particular weighting
factors of performance, ease of setup, hardware type, sysadmin
sophistication, failure scenarios, and total cost of ownership.  And
that cost is a non-linear function of scale. It's not true that HDFS
is always the best choice even for MapReduce.

> (2) HBase --- it seems HBase (Bigtable's log structured file) did a great
> job on this. The solution comes out of Google, it must be right.
I think this attitude is a major factor in why people choose HBase
(and HDFS).  But Google also sits at a particular point on the
many-dimensional factor space I alluded to above.  Best for Google
does not mean best for everyone.

> But would
> Google necessarily has chosen this approach in its Bigtable system should
> GFS did not exist in the first place? i.e, can we have alternative 'best'
> approach?
I bet you can guess my answer :)

> Anything else? I do not think HDFS is a good file system choice for
> enterprise applications.
Certainly not for most.

> On Tue, Jan 25, 2011 at 12:37 PM, Nathan Rutman <nr...@gmail.com> wrote:
>>
>> I have a very general question on the usefulness of HDFS for purposes
>> other than running distributed compute jobs for Hadoop.  Hadoop and HDFS
>> seem very popular these days, but the use of HDFS for other purposes
>> (database backend, records archiving, etc) confuses me, since there are
>> other free distributed filesystems out there (I personally work on Lustre),
>> with significantly better general-purpose performance.
>> So please tell me if I'm wrong about any of this.  Note I've gathered most
>> of my info from documentation rather than reading the source code.
>> As I understand it, HDFS was written specifically for Hadoop compute jobs,
>> with the following design factors in mind:
>>
>> write-once-read-many (worm) access model
>> use commodity hardware with relatively high failures rates (i.e.
>> assumptive failures)
>> long, sequential streaming data access
>> large files
>> hardware/OS agnostic
>> moving computation is cheaper than moving data
>>
>> While appropriate for processing many large-input Hadoop data-processing
>> jobs, there are significant penalties to be paid when trying to use these
>> design factors for more general-purpose storage:
>>
>> Commodity hardware requires data replication for safety.  The HDFS
>> implementation has three penalties: storage redundancy, network loading, and
>> blocking writes.  By default, HDFS blocks are replicated 3x: local,
>> "nearby", and "far away" to minimize the impact of data center catastrophe.
>>  In addition to the obvious 3x cost for storage, the result is that every
>> data block must be written "far away" - exactly the opposite of the "Move
>> Computation to Data" mantra.  Furthermore, these over-network writes are
>> synchronous; the client write blocks until all copies are complete on disk,
>> with the longest latency path of 2 network hops plus a disk write gating the
>> overall write speed.   Note that while this would be disastrous for a
>> general-purpose filesystem, with true WORM usage it may be acceptable to
>> penalize writes this way.
>
> Facebook seems to have a more cost effective way to do replication, but I am
> not sure about its MapReduce performance -- at the end of the day, there are
> only two 'proper' map slot machines that can host a 'cheap' mapper
> operation.
>
>>
>> Large block size implies fewer files.  HDFS reaches limits in the tens of
>> millions of files.
>> Large block size wastes space for small file.  The minimum file size is 1
>> block.
>> There is no data caching.  When delivering large contiguous streaming
>> data, this doesn't matter.  But when the read load is random, seeky, or
>> partial, this is a missing high-impact performance feature.
>
> Yes, can anyone answer this question? -- I want to ask the same question as
> well.

I talked to one of the principal HDFS designers, and he agreed with me
on all these points...

>>
>> In a WORM model, changing a small part of a file requires all the file
>> data to be copied, so e.g. database record modifications would be very
>> expensive.
>
> Yes, can anyone answer this question?
>>
>> There are no hardlinks, softlinks, or quotas.
... except that one.  HDFS now does softlinks and quotas.

>> HDFS isn't directly mountable, and therefore requires a non-standard API
>> to use.  (FUSE workaround exists.)
>> Java source code is very portable and easy to install, but not very quick.
>> Moving computation is cheaper than moving data.  But the data nonetheless
>> always has to be moved: either read off of a local hard drive or read over
>> the network into the compute node's memory.  It is not necessarily the case
>> that reading a local hard drive is faster than reading a distributed
>> (striped) file over a fast network.  Commodity network (e.g. 1GigE),
>> probably yes.  But a fast (and expensive) network (e.g. 4xDDR Infiniband)
>> can deliver data significantly faster than a local commodity hard drive.
>
> I agree with this statement: "It is not necessarily the case that reading a
> local hard drive is faster than reading a distributed (striped) file over a
> fast network. ", probably Infiniband as well as 10GigE network. And this is
> why I feel it might not be a good strategy that HBase entirely attach its
> design to HDFS.

I've proved this to my own satisfaction with a simple TestDFSIO
benchmark on HDFS and Lustre.  I posted the results in another thread
here.

>
>>
>> If I'm missing other points, pro- or con-, I would appreciate hearing
>> them.  Note again I'm not questioning the success of HDFS in achieving those
>> stated design choices, but rather trying to understand HDFS's applicability
>> to other storage domains beyond Hadoop.
>> Thanks for your time.

Re: HDFS without Hadoop: Why?

Posted by Ted Dunning <td...@maprtech.com>.

Speak for yourself!

With 12 local disks on a dual backplane, it is possible to sustain >1GB/s
read or write.  This is with fairly straightforward hardware.

That is pretty difficult to do with commodity networking (generally limited
to dual 1Gb NIC's for an aggregate of 200MB/s (if you are lucky)).

Moreover, bandwidth to locally cached data is vastly better than bandwidth
to remotely cached data.

I agree that network latency isn't a big factor next to disk latency, but
local bandwidth still wins.

On Tue, Feb 1, 2011 at 12:26 AM, tsuna <ts...@gmail.com> wrote:

> If you have a well designed network infrastructure, using local disks
> is not noticeably faster than using a remote disk.  The performance
> penalty of network hops is negligible compared to the performance
> penalty of going to disk.
>

Re: HDFS without Hadoop: Why?

Posted by tsuna <ts...@gmail.com>.

On Mon, Jan 31, 2011 at 6:34 PM, Sean Bigdatafun
<se...@gmail.com> wrote:
> On Tue, Jan 25, 2011 at 12:37 PM, Nathan Rutman <nr...@gmail.com> wrote:
>>    - Commodity hardware requires data replication for safety.  The HDFS
>>    implementation has three penalties: storage redundancy, network loading, and
>>    blocking writes.  By default, HDFS blocks are replicated 3x: local,
>>    "nearby", and "far away" to minimize the impact of data center catastrophe.
>>     In addition to the obvious 3x cost for storage, the result is that every
>>    data block must be written "far away" - exactly the opposite of the "Move
>>    Computation to Data" mantra.  Furthermore, these over-network writes are
>>    synchronous; the client write blocks until all copies are complete on disk,
>>    with the longest latency path of 2 network hops plus a disk write gating the
>>    overall write speed.   Note that while this would be disastrous for a
>>    general-purpose filesystem, with true WORM usage it may be acceptable to
>>    penalize writes this way.

If you have a well designed network infrastructure, using local disks
is not noticeably faster than using a remote disk.  The performance
penalty of network hops is negligible compared to the performance
penalty of going to disk.

>>    - There is no data caching.  When delivering large contiguous streaming
>>    data, this doesn't matter.  But when the read load is random, seeky, or
>>    partial, this is a missing high-impact performance feature.

The caching is provided by the OS.  The Linux page cache is doing a
much better job at caching and pre-fetching than a Hadoop DataNode can
ever dream of.

>>    - In a WORM model, changing a small part of a file requires all the
>>    file data to be copied, so e.g. database record modifications would be very
>>    expensive.
>>
> Yes, can anyone answer this question?

HBase doesn't modify files.  It only streams edits to a log, and
sometimes write full files at once.

>>    - Moving computation is cheaper than moving data.  But the data
>>    nonetheless always has to be moved: either read off of a local hard drive or
>>    read over the network into the compute node's memory.  It is not necessarily
>>    the case that reading a local hard drive is faster than reading a
>>    distributed (striped) file over a fast network.  Commodity network (e.g.
>>    1GigE), probably yes.  But a fast (and expensive) network (e.g. 4xDDR
>>    Infiniband) can deliver data significantly faster than a local commodity
>>    hard drive.
>>
> I agree with this statement: "It is not necessarily the case that reading a
> local hard drive is faster than reading a distributed (striped) file over a
> fast network. ", probably Infiniband as well as 10GigE network. And this is
> why I feel it might not be a good strategy that HBase entirely attach its
> design to HDFS.

You don't need Infiniband or 10GigE.  At StumbleUpon we have 1GbE
line-rate between every machine and the RTT latency across racks is
about 95µs.  It's far far more cost-effective than 10GbE or
Infiniband.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com