You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Nathan Rutman <nr...@gmail.com> on 2011/01/25 21:37:11 UTC

HDFS without Hadoop: Why?

I have a very general question on the usefulness of HDFS for purposes other than running distributed compute jobs for Hadoop.  Hadoop and HDFS seem very popular these days, but the use of HDFS for other purposes (database backend, records archiving, etc) confuses me, since there are other free distributed filesystems out there (I personally work on Lustre), with significantly better general-purpose performance.

So please tell me if I'm wrong about any of this.  Note I've gathered most of my info from documentation rather than reading the source code.

As I understand it, HDFS was written specifically for Hadoop compute jobs, with the following design factors in mind:
write-once-read-many (worm) access model
use commodity hardware with relatively high failures rates (i.e. assumptive failures)
long, sequential streaming data access
large files
hardware/OS agnostic
moving computation is cheaper than moving data

While appropriate for processing many large-input Hadoop data-processing jobs, there are significant penalties to be paid when trying to use these design factors for more general-purpose storage:
Commodity hardware requires data replication for safety.  The HDFS implementation has three penalties: storage redundancy, network loading, and blocking writes.  By default, HDFS blocks are replicated 3x: local, "nearby", and "far away" to minimize the impact of data center catastrophe.  In addition to the obvious 3x cost for storage, the result is that every data block must be written "far away" - exactly the opposite of the "Move Computation to Data" mantra.  Furthermore, these over-network writes are synchronous; the client write blocks until all copies are complete on disk, with the longest latency path of 2 network hops plus a disk write gating the overall write speed.   Note that while this would be disastrous for a general-purpose filesystem, with true WORM usage it may be acceptable to penalize writes this way.
Large block size implies fewer files.  HDFS reaches limits in the tens of millions of files.
Large block size wastes space for small file.  The minimum file size is 1 block.
There is no data caching.  When delivering large contiguous streaming data, this doesn't matter.  But when the read load is random, seeky, or partial, this is a missing high-impact performance feature.
In a WORM model, changing a small part of a file requires all the file data to be copied, so e.g. database record modifications would be very expensive.
There are no hardlinks, softlinks, or quotas.
HDFS isn't directly mountable, and therefore requires a non-standard API to use.  (FUSE workaround exists.)
Java source code is very portable and easy to install, but not very quick.
Moving computation is cheaper than moving data.  But the data nonetheless always has to be moved: either read off of a local hard drive or read over the network into the compute node's memory.  It is not necessarily the case that reading a local hard drive is faster than reading a distributed (striped) file over a fast network.  Commodity network (e.g. 1GigE), probably yes.  But a fast (and expensive) network (e.g. 4xDDR Infiniband) can deliver data significantly faster than a local commodity hard drive. 

If I'm missing other points, pro- or con-, I would appreciate hearing them.  Note again I'm not questioning the success of HDFS in achieving those stated design choices, but rather trying to understand HDFS's applicability to other storage domains beyond Hadoop.

Thanks for your time.


Re: HDFS without Hadoop: Why?

Posted by Stuart Smith <st...@yahoo.com>.
> Stuart - if Dhruba is giving hdfs file and block sizes used by the 
namenode, you really cannot get a more authoritative number elsewhere :)
 

Yes - very true! :)

I spaced out on the name there ... ;)

One more thing - I believe that if you're storing a lot of your smaller files in hbase, you'll end up with a lot less files on hdfs, since several of your smaller files will end up in one HFile??

I'm storing 5-7 million files, with at least 70-80% ending up in hbase. I only have 16 GB of RAM for my name-node, and it's very far from overloading the memory. Off the top of my head, I think it's << 8 GB of RAM used... 


Take care,
  -stu

--- On Wed, 2/2/11, Gaurav Sharma <ga...@gmail.com> wrote:

From: Gaurav Sharma <ga...@gmail.com>
Subject: Re: HDFS without Hadoop: Why?
To: hdfs-user@hadoop.apache.org
Date: Wednesday, February 2, 2011, 9:31 PM

Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode, you really cannot get a more authoritative number elsewhere :) I would do the back-of-envelope with ~160 bytes/file and ~150 bytes/block.



On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <st...@yahoo.com> wrote:



This is the best coverage I've seen from a source that would know:

http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/



One relevant quote:

To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM.

But, honestly, if you're just building out your cluster, you'll probably run into a lot of other limits first: hard drive space, regionserver memory, the infamous ulimit/xciever :), etc...the 



Take care,
  -stu

--- On Wed, 2/2/11, Dhruba Borthakur <dh...@gmail.com> wrote:



From: Dhruba Borthakur <dh...@gmail.com>
Subject: Re: HDFS without Hadoop: Why?
To: hdfs-user@hadoop.apache.org


Date:
 Wednesday, February 2, 2011, 9:00 PM

The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is a very rough calculation.
dhruba



On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <ch...@qualcomm.com> wrote:









What you describe is pretty much my use case as well. Since I don’t know how big the number of files could get , I am trying to figure out if there is a theoretical
 design limitation in hdfs…..
 
From what I have read, the name node will store all metadata of all files in the RAM. Assuming (in my case), that a file is less than the configured block size….there
 should be a very rough formula that can be used to calculate the max number of files that hdfs can serve based on the configured RAM on the name node?
 
Can any of the implementers comment on this? Am I even thinking on the right track…?
 
Thanks Ian for the haystack link…very informative indeed.
 
-Chinmay
 
 
 

From: Stuart Smith [mailto:stu24mail@yahoo.com]


Sent: Wednesday, February 02, 2011 4:41 PM

To: hdfs-user@hadoop.apache.org

Subject: RE: HDFS without Hadoop: Why?

 




Hello,

   I'm actually using hbase/hadoop/hdfs for lots of small files (with a long tail of larger files). Well, millions of small files - I don't know what you mean by lots :)




Facebook probably knows better, But what I do is:



  - store metadata in hbase

  - files smaller than 10 MB or so in hbase

   -larger files in a hdfs directory tree. 



I started storing 64 MB files and smaller in hbase (chunk size), but that causes issues with regionservers when running M/R jobs. This is related to the fact that I'm running a cobbled together cluster & my region servers don't have that much memory. I would
 play the size to see what works for you..



Take care, 

   -stu



--- On Wed, 2/2/11, Dhodapkar, Chinmay <ch...@qualcomm.com> wrote:


From: Dhodapkar, Chinmay <ch...@qualcomm.com>

Subject: RE: HDFS without Hadoop: Why?

To: "hdfs-user@hadoop.apache.org" <hd...@hadoop.apache.org>



Date: Wednesday, February 2, 2011, 7:28 PM


Hello,
 
I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose
 storage (no mapreduce/hbase etc).
 
Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store
 a very large number of relatively small files (1MB to 25MB).
 
Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack”
 (why not use hdfs since they already have it?). Anybody know what “haystack” is?
 
Thanks!
Chinmay
 
 
 

From: Jeff Hammerbacher [mailto:hammer@cloudera.com]


Sent: Wednesday, February 02, 2011 3:31 PM

To: hdfs-user@hadoop.apache.org

Subject: Re: HDFS without Hadoop: Why?

 








Large block size wastes space for small file.  The minimum file size is 1 block.




That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file.








There are no hardlinks, softlinks, or quotas.




That's incorrect; there are quotas and softlinks.








 






-- 
Connect to me at http://www.facebook.com/dhruba





      




      

Re: HDFS without Hadoop: Why?

Posted by Gaurav Sharma <ga...@gmail.com>.
Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode,
you really cannot get a more authoritative number elsewhere :) I would do
the back-of-envelope with ~160 bytes/file and ~150 bytes/block.

On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <st...@yahoo.com> wrote:

>
> This is the best coverage I've seen from a source that would know:
>
>
> http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
>
> One relevant quote:
>
> To store 100 million files (referencing 200 million blocks), a name-node
> should have at least 60 GB of RAM.
>
> But, honestly, if you're just building out your cluster, you'll probably
> run into a lot of other limits first: hard drive space, regionserver memory,
> the infamous ulimit/xciever :), etc...the
>
> Take care,
>   -stu
>
> --- On *Wed, 2/2/11, Dhruba Borthakur <dh...@gmail.com>* wrote:
>
>
> From: Dhruba Borthakur <dh...@gmail.com>
>
> Subject: Re: HDFS without Hadoop: Why?
> To: hdfs-user@hadoop.apache.org
> Date: Wednesday, February 2, 2011, 9:00 PM
>
>
> The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This
> is a very rough calculation.
>
> dhruba
>
> On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <ch...@qualcomm.com>
> > wrote:
>
>  What you describe is pretty much my use case as well. Since I don’t know
> how big the number of files could get , I am trying to figure out if there
> is a theoretical design limitation in hdfs…..
>
>
>
> From what I have read, the name node will store all metadata of all files
> in the RAM. Assuming (in my case), that a file is less than the configured
> block size….there should be a very rough formula that can be used to
> calculate the max number of files that hdfs can serve based on the
> configured RAM on the name node?
>
>
>
> Can any of the implementers comment on this? Am I even thinking on the
> right track…?
>
>
>
> Thanks Ian for the haystack link…very informative indeed.
>
>
>
> -Chinmay
>
>
>
>
>
>
>
> *From:* Stuart Smith [mailto:stu24mail@yahoo.com<ht...@yahoo.com>]
>
> *Sent:* Wednesday, February 02, 2011 4:41 PM
>
> *To:* hdfs-user@hadoop.apache.org<ht...@hadoop.apache.org>
> *Subject:* RE: HDFS without Hadoop: Why?
>
>
>
> Hello,
>    I'm actually using hbase/hadoop/hdfs for lots of small files (with a
> long tail of larger files). Well, millions of small files - I don't know
> what you mean by lots :)
>
> Facebook probably knows better, But what I do is:
>
>   - store metadata in hbase
>   - files smaller than 10 MB or so in hbase
>    -larger files in a hdfs directory tree.
>
> I started storing 64 MB files and smaller in hbase (chunk size), but that
> causes issues with regionservers when running M/R jobs. This is related to
> the fact that I'm running a cobbled together cluster & my region servers
> don't have that much memory. I would play the size to see what works for
> you..
>
> Take care,
>    -stu
>
> --- On *Wed, 2/2/11, Dhodapkar, Chinmay <ch...@qualcomm.com>
> >* wrote:
>
>
> From: Dhodapkar, Chinmay <ch...@qualcomm.com>
> >
> Subject: RE: HDFS without Hadoop: Why?
> To: "hdfs-user@hadoop.apache.org<ht...@hadoop.apache.org>"
> <hd...@hadoop.apache.org>
> >
> Date: Wednesday, February 2, 2011, 7:28 PM
>
> Hello,
>
>
>
> I have been following this thread for some time now. I am very comfortable
> with the advantages of hdfs, but still have lingering questions about the
> usage of hdfs for general purpose storage (no mapreduce/hbase etc).
>
>
>
> Can somebody shed light on what the limitations are on the number of files
> that can be stored. Is it limited in anyway by the namenode? The use case I
> am interested in is to store a very large number of relatively small files
> (1MB to 25MB).
>
>
>
> Interestingly, I saw a facebook presentation on how they use hbase/hdfs
> internally. Them seem to store all metadata in hbase and the actual
> images/files/etc in something called “haystack” (why not use hdfs since they
> already have it?). Anybody know what “haystack” is?
>
>
>
> Thanks!
>
> Chinmay
>
>
>
>
>
>
>
> *From:* Jeff Hammerbacher [mailto:hammer@cloudera.com<ht...@cloudera.com>]
>
> *Sent:* Wednesday, February 02, 2011 3:31 PM
> *To:* hdfs-user@hadoop.apache.org<ht...@hadoop.apache.org>
> *Subject:* Re: HDFS without Hadoop: Why?
>
>
>
>
>    - Large block size wastes space for small file.  The minimum file size
>    is 1 block.
>
>   That's incorrect. If a file is smaller than the block size, it will only
> consume as much space as there is data in the file.
>
>
>    - There are no hardlinks, softlinks, or quotas.
>
>   That's incorrect; there are quotas and softlinks.
>
>
>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>
>
>

Re: HDFS without Hadoop: Why?

Posted by Konstantin Shvachko <sh...@gmail.com>.
Nathan,

Great references. There is a good place to put them to:
http://wiki.apache.org/hadoop/HDFS_Publications
GPFS and Lustre papers are not there yet, I believe.

Thanks,
--Konstantin

On Thu, Feb 3, 2011 at 10:48 AM, Nathan Rutman <nr...@gmail.com> wrote:

>
> On Feb 2, 2011, at 6:42 PM, Konstantin Shvachko wrote:
>
> Thanks for the link Stu.
> More details are on limitations are here:
> http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf
>
> I think that Nathan raised an interesting question and his assessment of
> HDFS use
> cases are generally right.
> Some assumptions though are outdated at this point.
> And people mentioned about it in the thread.
> We have append implementation, which allows reopening files for updates.
> We also have symbolic links and quotas (space and name-space).
> The api to HDFS is not posix, true. But in addition to Fuse people also use
>
> Thrift to access hdfs.
> Most of these features are explained in HDFS overview paper:
> http://storageconference.org/2010/Papers/MSST/Shvachko.pdf
>
> Stand-alone HDFS is actually used in several places. I like what
> Brian Bockelman at University of Nebraska does.
> They store CERN data in their cluster, and physicists use Fortran to access
> the data,
> not map-reduce, as I heard.
> http://storageconference.org/2010/Presentations/MSST/3.Bockelman.pdf
>
> This doesn't seem to mention what storage they're using.
>
>
> With respect to other distributed file systems. HDFS performance was
> compared to
> PVFS, GPFS and Lustre. The results were in favor of HDFS. See e.g.
>
> PVFS
>
> http://www.cs.cmu.edu/~wtantisi/files/hadooppvfs-pdl08.pdf<http://www.cs.cmu.edu/%7Ewtantisi/files/hadooppvfs-pdl08.pdf>
>
>
> Some other references for those interested:  HDFS vs
> GPFS
> Cloud analytics: Do we really need to reinvent the storage stack?<http://www.usenix.org/event/hotcloud09/tech/full_papers/ananthanarayanan.pdf>
> Lustre
> http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf
> Ceph
> www.usenix.org—maltzahn.pdf<http://www.usenix.org/publications/login/2010-08/openpdfs/maltzahn.pdf>
>
> These GPFS and Lustre papers were both favorable toward HDFS because
> they missed a fundamental issue: for the former FS's, network speed is
> critical.
> HDFS doesn't need network on reads (ideally), and so is simultaneously
> immune to network
> speed, but also cannot take advantage of network speed.  For slow networks
> (1GigE)
> this plays into HDFS's strength, but for fast networks (10GigE,
> Infiniband),
> the balance tips the other way. (My testing: for a heavily loaded network,
> a 3-4x read
> speed factor for Lustre.  For writes, the difference is even more extreme
> (10x),
> since HDFS has to hop all write data over the network twice.)
>
> Let me say clearly that your choice of FS should depend on which of many
> factors
> are most important to you -- there is no "one size fits all", although that
> sadly makes our
> decisions more complex.  For those using Hadoop that have a high weighting
> on
> IO performance (as well as some other factors I listed in my original
> mail), I suggest you
> at least think about spending money on a fast network and using a FS that
> can utilize it.
>
>
> So I agree with Nathan HDFS was designed and optimized as a storage layer
> for
> map-reduce type tasks, but it performs well as a general purpose fs as
> well.
>
> Thanks,
> --Konstantin
>
>
>
>
> On Wed, Feb 2, 2011 at 6:08 PM, Stuart Smith <st...@yahoo.com> wrote:
>
>>
>> This is the best coverage I've seen from a source that would know:
>>
>>
>> http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
>>
>> One relevant quote:
>>
>> To store 100 million files (referencing 200 million blocks), a name-node
>> should have at least 60 GB of RAM.
>>
>> But, honestly, if you're just building out your cluster, you'll probably
>> run into a lot of other limits first: hard drive space, regionserver memory,
>> the infamous ulimit/xciever :), etc...
>>
>> Take care,
>>   -stu
>>
>> --- On *Wed, 2/2/11, Dhruba Borthakur <dh...@gmail.com>* wrote:
>>
>>
>> From: Dhruba Borthakur <dh...@gmail.com>
>> Subject: Re: HDFS without Hadoop: Why?
>> To: hdfs-user@hadoop.apache.org
>> Date: Wednesday, February 2, 2011, 9:00 PM
>>
>> The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This
>> is a very rough calculation.
>>
>> dhruba
>>
>> On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <ch...@qualcomm.com>
>> > wrote:
>>
>>  What you describe is pretty much my use case as well. Since I don’t know
>> how big the number of files could get , I am trying to figure out if there
>> is a theoretical design limitation in hdfs…..
>>
>>
>> From what I have read, the name node will store all metadata of all files
>> in the RAM. Assuming (in my case), that a file is less than the configured
>> block size….there should be a very rough formula that can be used to
>> calculate the max number of files that hdfs can serve based on the
>> configured RAM on the name node?
>>
>>
>> Can any of the implementers comment on this? Am I even thinking on the
>> right track…?
>>
>>
>> Thanks Ian for the haystack link…very informative indeed.
>>
>>
>> -Chinmay
>>
>>
>>
>>
>> *From:* Stuart Smith [mailto:stu24mail@yahoo.com<ht...@yahoo.com>]
>>
>> *Sent:* Wednesday, February 02, 2011 4:41 PM
>>
>> *To:* hdfs-user@hadoop.apache.org<ht...@hadoop.apache.org>
>> *Subject:* RE: HDFS without Hadoop: Why?
>>
>>
>>
>> Hello,
>>    I'm actually using hbase/hadoop/hdfs for lots of small files (with a
>> long tail of larger files). Well, millions of small files - I don't know
>> what you mean by lots :)
>>
>> Facebook probably knows better, But what I do is:
>>
>>   - store metadata in hbase
>>   - files smaller than 10 MB or so in hbase
>>    -larger files in a hdfs directory tree.
>>
>> I started storing 64 MB files and smaller in hbase (chunk size), but that
>> causes issues with regionservers when running M/R jobs. This is related to
>> the fact that I'm running a cobbled together cluster & my region servers
>> don't have that much memory. I would play the size to see what works for
>> you..
>>
>> Take care,
>>    -stu
>>
>> --- On *Wed, 2/2/11, Dhodapkar, Chinmay <ch...@qualcomm.com>
>> >* wrote:
>>
>>
>> From: Dhodapkar, Chinmay <ch...@qualcomm.com>
>> >
>> Subject: RE: HDFS without Hadoop: Why?
>> To: "hdfs-user@hadoop.apache.org<ht...@hadoop.apache.org>"
>> <hd...@hadoop.apache.org>
>> >
>> Date: Wednesday, February 2, 2011, 7:28 PM
>>
>> Hello,
>>
>>
>> I have been following this thread for some time now. I am very comfortable
>> with the advantages of hdfs, but still have lingering questions about the
>> usage of hdfs for general purpose storage (no mapreduce/hbase etc).
>>
>>
>> Can somebody shed light on what the limitations are on the number of files
>> that can be stored. Is it limited in anyway by the namenode? The use case I
>> am interested in is to store a very large number of relatively small files
>> (1MB to 25MB).
>>
>>
>> Interestingly, I saw a facebook presentation on how they use hbase/hdfs
>> internally. Them seem to store all metadata in hbase and the actual
>> images/files/etc in something called “haystack” (why not use hdfs since they
>> already have it?). Anybody know what “haystack” is?
>>
>>
>> Thanks!
>>
>> Chinmay
>>
>>
>>
>>
>> *From:* Jeff Hammerbacher [mailto:hammer@cloudera.com<ht...@cloudera.com>]
>>
>> *Sent:* Wednesday, February 02, 2011 3:31 PM
>> *To:* hdfs-user@hadoop.apache.org<ht...@hadoop.apache.org>
>> *Subject:* Re: HDFS without Hadoop: Why?
>>
>>
>>
>>    - Large block size wastes space for small file.  The minimum file size
>>    is 1 block.
>>
>>   That's incorrect. If a file is smaller than the block size, it will
>> only consume as much space as there is data in the file.
>>
>>
>>    - There are no hardlinks, softlinks, or quotas.
>>
>>   That's incorrect; there are quotas and softlinks.
>>
>>
>>
>>
>>
>> --
>> Connect to me at http://www.facebook.com/dhruba
>>
>>
>>
>
>

Re: HDFS without Hadoop: Why?

Posted by Nathan Rutman <nr...@gmail.com>.
On Feb 2, 2011, at 6:42 PM, Konstantin Shvachko wrote:

> Thanks for the link Stu.
> More details are on limitations are here:
> http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf
> 
> I think that Nathan raised an interesting question and his assessment of HDFS use 
> cases are generally right.
> Some assumptions though are outdated at this point. 
> And people mentioned about it in the thread.
> We have append implementation, which allows reopening files for updates.
> We also have symbolic links and quotas (space and name-space).
> The api to HDFS is not posix, true. But in addition to Fuse people also use 
> Thrift to access hdfs.
> Most of these features are explained in HDFS overview paper:
> http://storageconference.org/2010/Papers/MSST/Shvachko.pdf
> 
> Stand-alone HDFS is actually used in several places. I like what
> Brian Bockelman at University of Nebraska does.
> They store CERN data in their cluster, and physicists use Fortran to access the data,
> not map-reduce, as I heard.
> http://storageconference.org/2010/Presentations/MSST/3.Bockelman.pdf
This doesn't seem to mention what storage they're using.
> 
> With respect to other distributed file systems. HDFS performance was compared to
> PVFS, GPFS and Lustre. The results were in favor of HDFS. See e.g.
PVFS
> http://www.cs.cmu.edu/~wtantisi/files/hadooppvfs-pdl08.pdf
> 

Some other references for those interested:  HDFS vs
GPFS
Cloud analytics: Do we really need to reinvent the storage stack?
Lustre
http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf
Ceph
www.usenix.org—maltzahn.pdf

These GPFS and Lustre papers were both favorable toward HDFS because
they missed a fundamental issue: for the former FS's, network speed is critical.
HDFS doesn't need network on reads (ideally), and so is simultaneously immune to network
speed, but also cannot take advantage of network speed.  For slow networks (1GigE)
this plays into HDFS's strength, but for fast networks (10GigE, Infiniband),
the balance tips the other way. (My testing: for a heavily loaded network, a 3-4x read 
speed factor for Lustre.  For writes, the difference is even more extreme (10x), 
since HDFS has to hop all write data over the network twice.)

Let me say clearly that your choice of FS should depend on which of many factors
are most important to you -- there is no "one size fits all", although that sadly makes our
decisions more complex.  For those using Hadoop that have a high weighting on
IO performance (as well as some other factors I listed in my original mail), I suggest you 
at least think about spending money on a fast network and using a FS that can utilize it.


> So I agree with Nathan HDFS was designed and optimized as a storage layer for 
> map-reduce type tasks, but it performs well as a general purpose fs as well.
> 
> Thanks,
> --Konstantin
> 
> 
> 
> 
> On Wed, Feb 2, 2011 at 6:08 PM, Stuart Smith <st...@yahoo.com> wrote:
> 
> This is the best coverage I've seen from a source that would know:
> 
> http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
> 
> One relevant quote:
> 
> To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM.
> 
> But, honestly, if you're just building out your cluster, you'll probably run into a lot of other limits first: hard drive space, regionserver memory, the infamous ulimit/xciever :), etc...
> 
> Take care,
>   -stu
> 
> --- On Wed, 2/2/11, Dhruba Borthakur <dh...@gmail.com> wrote:
> 
> From: Dhruba Borthakur <dh...@gmail.com>
> Subject: Re: HDFS without Hadoop: Why?
> To: hdfs-user@hadoop.apache.org
> Date: Wednesday, February 2, 2011, 9:00 PM
> 
> The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is a very rough calculation.
> 
> dhruba
> 
> On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <ch...@qualcomm.com> wrote:
> What you describe is pretty much my use case as well. Since I don’t know how big the number of files could get , I am trying to figure out if there is a theoretical design limitation in hdfs…..
> 
>  
> From what I have read, the name node will store all metadata of all files in the RAM. Assuming (in my case), that a file is less than the configured block size….there should be a very rough formula that can be used to calculate the max number of files that hdfs can serve based on the configured RAM on the name node?
> 
>  
> Can any of the implementers comment on this? Am I even thinking on the right track…?
> 
>  
> Thanks Ian for the haystack link…very informative indeed.
> 
>  
> -Chinmay
> 
>  
>  
>  
> From: Stuart Smith [mailto:stu24mail@yahoo.com] 
> Sent: Wednesday, February 02, 2011 4:41 PM
> 
> 
> To: hdfs-user@hadoop.apache.org
> Subject: RE: HDFS without Hadoop: Why?
> 
>  
> Hello,
>    I'm actually using hbase/hadoop/hdfs for lots of small files (with a long tail of larger files). Well, millions of small files - I don't know what you mean by lots :) 
> 
> Facebook probably knows better, But what I do is:
> 
>   - store metadata in hbase
>   - files smaller than 10 MB or so in hbase
>    -larger files in a hdfs directory tree. 
> 
> I started storing 64 MB files and smaller in hbase (chunk size), but that causes issues with regionservers when running M/R jobs. This is related to the fact that I'm running a cobbled together cluster & my region servers don't have that much memory. I would play the size to see what works for you..
> 
> Take care, 
>    -stu
> 
> --- On Wed, 2/2/11, Dhodapkar, Chinmay <ch...@qualcomm.com> wrote:
> 
> 
> From: Dhodapkar, Chinmay <ch...@qualcomm.com>
> Subject: RE: HDFS without Hadoop: Why?
> To: "hdfs-user@hadoop.apache.org" <hd...@hadoop.apache.org>
> Date: Wednesday, February 2, 2011, 7:28 PM
> 
> Hello,
> 
>  
> I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc).
> 
>  
> Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB).
> 
>  
> Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack” (why not use hdfs since they already have it?). Anybody know what “haystack” is?
> 
>  
> Thanks!
> 
> Chinmay
> 
>  
>  
>  
> From: Jeff Hammerbacher [mailto:hammer@cloudera.com] 
> Sent: Wednesday, February 02, 2011 3:31 PM
> To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
> 
>  
> Large block size wastes space for small file.  The minimum file size is 1 block.
> That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file.
> 
> There are no hardlinks, softlinks, or quotas.
> That's incorrect; there are quotas and softlinks.
> 
>  
> 
> 
> 
> -- 
> Connect to me at http://www.facebook.com/dhruba
> 
> 


Re: HDFS without Hadoop: Why?

Posted by Konstantin Shvachko <sh...@gmail.com>.
Thanks for the link Stu.
More details are on limitations are here:
http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf

I think that Nathan raised an interesting question and his assessment of
HDFS use
cases are generally right.
Some assumptions though are outdated at this point.
And people mentioned about it in the thread.
We have append implementation, which allows reopening files for updates.
We also have symbolic links and quotas (space and name-space).
The api to HDFS is not posix, true. But in addition to Fuse people also use
Thrift to access hdfs.
Most of these features are explained in HDFS overview paper:
http://storageconference.org/2010/Papers/MSST/Shvachko.pdf

Stand-alone HDFS is actually used in several places. I like what
Brian Bockelman at University of Nebraska does.
They store CERN data in their cluster, and physicists use Fortran to access
the data,
not map-reduce, as I heard.
http://storageconference.org/2010/Presentations/MSST/3.Bockelman.pdf

With respect to other distributed file systems. HDFS performance was
compared to
PVFS, GPFS and Lustre. The results were in favor of HDFS. See e.g.
http://www.cs.cmu.edu/~wtantisi/files/hadooppvfs-pdl08.pdf<http://www.cs.cmu.edu/%7Ewtantisi/files/hadooppvfs-pdl08.pdf>

So I agree with Nathan HDFS was designed and optimized as a storage layer
for
map-reduce type tasks, but it performs well as a general purpose fs as well.

Thanks,
--Konstantin




On Wed, Feb 2, 2011 at 6:08 PM, Stuart Smith <st...@yahoo.com> wrote:

>
> This is the best coverage I've seen from a source that would know:
>
>
> http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
>
> One relevant quote:
>
> To store 100 million files (referencing 200 million blocks), a name-node
> should have at least 60 GB of RAM.
>
> But, honestly, if you're just building out your cluster, you'll probably
> run into a lot of other limits first: hard drive space, regionserver memory,
> the infamous ulimit/xciever :), etc...
>
> Take care,
>   -stu
>
> --- On *Wed, 2/2/11, Dhruba Borthakur <dh...@gmail.com>* wrote:
>
>
> From: Dhruba Borthakur <dh...@gmail.com>
> Subject: Re: HDFS without Hadoop: Why?
> To: hdfs-user@hadoop.apache.org
> Date: Wednesday, February 2, 2011, 9:00 PM
>
> The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This
> is a very rough calculation.
>
> dhruba
>
> On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <ch...@qualcomm.com>
> > wrote:
>
>  What you describe is pretty much my use case as well. Since I don’t know
> how big the number of files could get , I am trying to figure out if there
> is a theoretical design limitation in hdfs…..
>
>
>
> From what I have read, the name node will store all metadata of all files
> in the RAM. Assuming (in my case), that a file is less than the configured
> block size….there should be a very rough formula that can be used to
> calculate the max number of files that hdfs can serve based on the
> configured RAM on the name node?
>
>
>
> Can any of the implementers comment on this? Am I even thinking on the
> right track…?
>
>
>
> Thanks Ian for the haystack link…very informative indeed.
>
>
>
> -Chinmay
>
>
>
>
>
>
>
> *From:* Stuart Smith [mailto:stu24mail@yahoo.com<ht...@yahoo.com>]
>
> *Sent:* Wednesday, February 02, 2011 4:41 PM
>
> *To:* hdfs-user@hadoop.apache.org<ht...@hadoop.apache.org>
> *Subject:* RE: HDFS without Hadoop: Why?
>
>
>
> Hello,
>    I'm actually using hbase/hadoop/hdfs for lots of small files (with a
> long tail of larger files). Well, millions of small files - I don't know
> what you mean by lots :)
>
> Facebook probably knows better, But what I do is:
>
>   - store metadata in hbase
>   - files smaller than 10 MB or so in hbase
>    -larger files in a hdfs directory tree.
>
> I started storing 64 MB files and smaller in hbase (chunk size), but that
> causes issues with regionservers when running M/R jobs. This is related to
> the fact that I'm running a cobbled together cluster & my region servers
> don't have that much memory. I would play the size to see what works for
> you..
>
> Take care,
>    -stu
>
> --- On *Wed, 2/2/11, Dhodapkar, Chinmay <ch...@qualcomm.com>
> >* wrote:
>
>
> From: Dhodapkar, Chinmay <ch...@qualcomm.com>
> >
> Subject: RE: HDFS without Hadoop: Why?
> To: "hdfs-user@hadoop.apache.org<ht...@hadoop.apache.org>"
> <hd...@hadoop.apache.org>
> >
> Date: Wednesday, February 2, 2011, 7:28 PM
>
> Hello,
>
>
>
> I have been following this thread for some time now. I am very comfortable
> with the advantages of hdfs, but still have lingering questions about the
> usage of hdfs for general purpose storage (no mapreduce/hbase etc).
>
>
>
> Can somebody shed light on what the limitations are on the number of files
> that can be stored. Is it limited in anyway by the namenode? The use case I
> am interested in is to store a very large number of relatively small files
> (1MB to 25MB).
>
>
>
> Interestingly, I saw a facebook presentation on how they use hbase/hdfs
> internally. Them seem to store all metadata in hbase and the actual
> images/files/etc in something called “haystack” (why not use hdfs since they
> already have it?). Anybody know what “haystack” is?
>
>
>
> Thanks!
>
> Chinmay
>
>
>
>
>
>
>
> *From:* Jeff Hammerbacher [mailto:hammer@cloudera.com<ht...@cloudera.com>]
>
> *Sent:* Wednesday, February 02, 2011 3:31 PM
> *To:* hdfs-user@hadoop.apache.org<ht...@hadoop.apache.org>
> *Subject:* Re: HDFS without Hadoop: Why?
>
>
>
>
>    - Large block size wastes space for small file.  The minimum file size
>    is 1 block.
>
>   That's incorrect. If a file is smaller than the block size, it will only
> consume as much space as there is data in the file.
>
>
>    - There are no hardlinks, softlinks, or quotas.
>
>   That's incorrect; there are quotas and softlinks.
>
>
>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>
>
>

Re: HDFS without Hadoop: Why?

Posted by Stuart Smith <st...@yahoo.com>.
This is the best coverage I've seen from a source that would know:

http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/

One relevant quote:

To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM.

But, honestly, if you're just building out your cluster, you'll probably run into a lot of other limits first: hard drive space, regionserver memory, the infamous ulimit/xciever :), etc...

Take care,
  -stu

--- On Wed, 2/2/11, Dhruba Borthakur <dh...@gmail.com> wrote:

From: Dhruba Borthakur <dh...@gmail.com>
Subject: Re: HDFS without Hadoop: Why?
To: hdfs-user@hadoop.apache.org
Date: Wednesday, February 2, 2011, 9:00 PM

The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is a very rough calculation.
dhruba

On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <ch...@qualcomm.com> wrote:









What you describe is pretty much my use case as well. Since I don’t know how big the number of files could get , I am trying to figure out if there is a theoretical
 design limitation in hdfs…..
 
From what I have read, the name node will store all metadata of all files in the RAM. Assuming (in my case), that a file is less than the configured block size….there
 should be a very rough formula that can be used to calculate the max number of files that hdfs can serve based on the configured RAM on the name node?
 
Can any of the implementers comment on this? Am I even thinking on the right track…?
 
Thanks Ian for the haystack link…very informative indeed.
 
-Chinmay
 
 
 

From: Stuart Smith [mailto:stu24mail@yahoo.com]


Sent: Wednesday, February 02, 2011 4:41 PM

To: hdfs-user@hadoop.apache.org

Subject: RE: HDFS without Hadoop: Why?

 




Hello,

   I'm actually using hbase/hadoop/hdfs for lots of small files (with a long tail of larger files). Well, millions of small files - I don't know what you mean by lots :)




Facebook probably knows better, But what I do is:



  - store metadata in hbase

  - files smaller than 10 MB or so in hbase

   -larger files in a hdfs directory tree. 



I started storing 64 MB files and smaller in hbase (chunk size), but that causes issues with regionservers when running M/R jobs. This is related to the fact that I'm running a cobbled together cluster & my region servers don't have that much memory. I would
 play the size to see what works for you..



Take care, 

   -stu



--- On Wed, 2/2/11, Dhodapkar, Chinmay <ch...@qualcomm.com> wrote:


From: Dhodapkar, Chinmay <ch...@qualcomm.com>

Subject: RE: HDFS without Hadoop: Why?

To: "hdfs-user@hadoop.apache.org" <hd...@hadoop.apache.org>

Date: Wednesday, February 2, 2011, 7:28 PM


Hello,
 
I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose
 storage (no mapreduce/hbase etc).
 
Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store
 a very large number of relatively small files (1MB to 25MB).
 
Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack”
 (why not use hdfs since they already have it?). Anybody know what “haystack” is?
 
Thanks!
Chinmay
 
 
 

From: Jeff Hammerbacher [mailto:hammer@cloudera.com]


Sent: Wednesday, February 02, 2011 3:31 PM

To: hdfs-user@hadoop.apache.org

Subject: Re: HDFS without Hadoop: Why?

 






Large block size wastes space for small file.  The minimum file size is 1 block.




That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file.






There are no hardlinks, softlinks, or quotas.




That's incorrect; there are quotas and softlinks.








 






-- 
Connect to me at http://www.facebook.com/dhruba





      

Re: HDFS without Hadoop: Why?

Posted by Dhruba Borthakur <dh...@gmail.com>.
The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is
a very rough calculation.

dhruba

On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <ch...@qualcomm.com>wrote:

>  What you describe is pretty much my use case as well. Since I don’t know
> how big the number of files could get , I am trying to figure out if there
> is a theoretical design limitation in hdfs…..
>
>
>
> From what I have read, the name node will store all metadata of all files
> in the RAM. Assuming (in my case), that a file is less than the configured
> block size….there should be a very rough formula that can be used to
> calculate the max number of files that hdfs can serve based on the
> configured RAM on the name node?
>
>
>
> Can any of the implementers comment on this? Am I even thinking on the
> right track…?
>
>
>
> Thanks Ian for the haystack link…very informative indeed.
>
>
>
> -Chinmay
>
>
>
>
>
>
>
> *From:* Stuart Smith [mailto:stu24mail@yahoo.com]
> *Sent:* Wednesday, February 02, 2011 4:41 PM
>
> *To:* hdfs-user@hadoop.apache.org
> *Subject:* RE: HDFS without Hadoop: Why?
>
>
>
> Hello,
>    I'm actually using hbase/hadoop/hdfs for lots of small files (with a
> long tail of larger files). Well, millions of small files - I don't know
> what you mean by lots :)
>
> Facebook probably knows better, But what I do is:
>
>   - store metadata in hbase
>   - files smaller than 10 MB or so in hbase
>    -larger files in a hdfs directory tree.
>
> I started storing 64 MB files and smaller in hbase (chunk size), but that
> causes issues with regionservers when running M/R jobs. This is related to
> the fact that I'm running a cobbled together cluster & my region servers
> don't have that much memory. I would play the size to see what works for
> you..
>
> Take care,
>    -stu
>
> --- On *Wed, 2/2/11, Dhodapkar, Chinmay <ch...@qualcomm.com>* wrote:
>
>
> From: Dhodapkar, Chinmay <ch...@qualcomm.com>
> Subject: RE: HDFS without Hadoop: Why?
> To: "hdfs-user@hadoop.apache.org" <hd...@hadoop.apache.org>
> Date: Wednesday, February 2, 2011, 7:28 PM
>
> Hello,
>
>
>
> I have been following this thread for some time now. I am very comfortable
> with the advantages of hdfs, but still have lingering questions about the
> usage of hdfs for general purpose storage (no mapreduce/hbase etc).
>
>
>
> Can somebody shed light on what the limitations are on the number of files
> that can be stored. Is it limited in anyway by the namenode? The use case I
> am interested in is to store a very large number of relatively small files
> (1MB to 25MB).
>
>
>
> Interestingly, I saw a facebook presentation on how they use hbase/hdfs
> internally. Them seem to store all metadata in hbase and the actual
> images/files/etc in something called “haystack” (why not use hdfs since they
> already have it?). Anybody know what “haystack” is?
>
>
>
> Thanks!
>
> Chinmay
>
>
>
>
>
>
>
> *From:* Jeff Hammerbacher [mailto:hammer@cloudera.com]
> *Sent:* Wednesday, February 02, 2011 3:31 PM
> *To:* hdfs-user@hadoop.apache.org
> *Subject:* Re: HDFS without Hadoop: Why?
>
>
>
>
>    - Large block size wastes space for small file.  The minimum file size
>    is 1 block.
>
>   That's incorrect. If a file is smaller than the block size, it will only
> consume as much space as there is data in the file.
>
>
>    - There are no hardlinks, softlinks, or quotas.
>
>   That's incorrect; there are quotas and softlinks.
>
>
>



-- 
Connect to me at http://www.facebook.com/dhruba

RE: HDFS without Hadoop: Why?

Posted by "Dhodapkar, Chinmay" <ch...@qualcomm.com>.
What you describe is pretty much my use case as well. Since I don’t know how big the number of files could get , I am trying to figure out if there is a theoretical design limitation in hdfs…..

From what I have read, the name node will store all metadata of all files in the RAM. Assuming (in my case), that a file is less than the configured block size….there should be a very rough formula that can be used to calculate the max number of files that hdfs can serve based on the configured RAM on the name node?

Can any of the implementers comment on this? Am I even thinking on the right track…?

Thanks Ian for the haystack link…very informative indeed.

-Chinmay



From: Stuart Smith [mailto:stu24mail@yahoo.com]
Sent: Wednesday, February 02, 2011 4:41 PM
To: hdfs-user@hadoop.apache.org
Subject: RE: HDFS without Hadoop: Why?

Hello,
   I'm actually using hbase/hadoop/hdfs for lots of small files (with a long tail of larger files). Well, millions of small files - I don't know what you mean by lots :)

Facebook probably knows better, But what I do is:

  - store metadata in hbase
  - files smaller than 10 MB or so in hbase
   -larger files in a hdfs directory tree.

I started storing 64 MB files and smaller in hbase (chunk size), but that causes issues with regionservers when running M/R jobs. This is related to the fact that I'm running a cobbled together cluster & my region servers don't have that much memory. I would play the size to see what works for you..

Take care,
   -stu

--- On Wed, 2/2/11, Dhodapkar, Chinmay <ch...@qualcomm.com> wrote:

From: Dhodapkar, Chinmay <ch...@qualcomm.com>
Subject: RE: HDFS without Hadoop: Why?
To: "hdfs-user@hadoop.apache.org" <hd...@hadoop.apache.org>
Date: Wednesday, February 2, 2011, 7:28 PM

Hello,



I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc).



Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB).



Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack” (why not use hdfs since they already have it?). Anybody know what “haystack” is?



Thanks!

Chinmay







From: Jeff Hammerbacher [mailto:hammer@cloudera.com]
Sent: Wednesday, February 02, 2011 3:31 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: HDFS without Hadoop: Why?



  *   Large block size wastes space for small file.  The minimum file size is 1 block.

That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file.

  *   There are no hardlinks, softlinks, or quotas.

That's incorrect; there are quotas and softlinks.



Re: HDFS without Hadoop: Why?

Posted by Ian Holsman <ha...@holsman.net>.
Haystack is described here
http://www.facebook.com/note.php?note_id=76191543919

Regards
Ian


--- 
Ian Holsman
AOL Inc
Ian.Holsman@teamaol.com
(703) 879-3128 / AIM:ianholsman 

it's just a technicality

On Feb 2, 2011, at 7:28 PM, "Dhodapkar, Chinmay" <ch...@qualcomm.com> wrote:

> Hello,
> 
>  
> 
> I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc).
> 
>  
> 
> Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB).
> 
>  
> 
> Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called “haystack” (why not use hdfs since they already have it?). Anybody know what “haystack” is?
> 
>  
> 
> Thanks!
> 
> Chinmay
> 
>  
> 
>  
> 
> 
>  
> 
> From: Jeff Hammerbacher [mailto:hammer@cloudera.com] 
> Sent: Wednesday, February 02, 2011 3:31 PM
> To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
> 
>  
> 
> Large block size wastes space for small file.  The minimum file size is 1 block.
> That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file.
> 
> There are no hardlinks, softlinks, or quotas.
> That's incorrect; there are quotas and softlinks.

RE: HDFS without Hadoop: Why?

Posted by Stuart Smith <st...@yahoo.com>.
Hello,
   I'm actually using hbase/hadoop/hdfs for lots of small files (with a long tail of larger files). Well, millions of small files - I don't know what you mean by lots :) 

Facebook probably knows better, But what I do is:

  - store metadata in hbase
  - files smaller than 10 MB or so in hbase
   -larger files in a hdfs directory tree. 

I started storing 64 MB files and smaller in hbase (chunk size), but that causes issues with regionservers when running M/R jobs. This is related to the fact that I'm running a cobbled together cluster & my region servers don't have that much memory. I would play the size to see what works for you..

Take care, 
   -stu

--- On Wed, 2/2/11, Dhodapkar, Chinmay <ch...@qualcomm.com> wrote:

From: Dhodapkar, Chinmay <ch...@qualcomm.com>
Subject: RE: HDFS without Hadoop: Why?
To: "hdfs-user@hadoop.apache.org" <hd...@hadoop.apache.org>
Date: Wednesday, February 2, 2011, 7:28 PM



 
 


 

Hello, 
   
I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of
 hdfs for general purpose storage (no mapreduce/hbase etc). 
   
Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested
 in is to store a very large number of relatively small files (1MB to 25MB). 
   
Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc
 in something called “haystack” (why not use hdfs since they already have it?). Anybody know what “haystack” is? 
   
Thanks! 
Chinmay 
   
   
 
   

From: Jeff Hammerbacher [mailto:hammer@cloudera.com]


Sent: Wednesday, February 02, 2011 3:31 PM

To: hdfs-user@hadoop.apache.org

Subject: Re: HDFS without Hadoop: Why? 

   






Large block size wastes space for small file.  The minimum file size is 1 block. 




That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file. 






There are no hardlinks, softlinks, or quotas. 




That's incorrect; there are quotas and softlinks. 



 




      

RE: HDFS without Hadoop: Why?

Posted by "Dhodapkar, Chinmay" <ch...@qualcomm.com>.
Hello,

I have been following this thread for some time now. I am very comfortable with the advantages of hdfs, but still have lingering questions about the usage of hdfs for general purpose storage (no mapreduce/hbase etc).

Can somebody shed light on what the limitations are on the number of files that can be stored. Is it limited in anyway by the namenode? The use case I am interested in is to store a very large number of relatively small files (1MB to 25MB).

Interestingly, I saw a facebook presentation on how they use hbase/hdfs internally. Them seem to store all metadata in hbase and the actual images/files/etc in something called "haystack" (why not use hdfs since they already have it?). Anybody know what "haystack" is?

Thanks!
Chinmay



From: Jeff Hammerbacher [mailto:hammer@cloudera.com]
Sent: Wednesday, February 02, 2011 3:31 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: HDFS without Hadoop: Why?


  *   Large block size wastes space for small file.  The minimum file size is 1 block.
That's incorrect. If a file is smaller than the block size, it will only consume as much space as there is data in the file.

  *   There are no hardlinks, softlinks, or quotas.
That's incorrect; there are quotas and softlinks.

Re: HDFS without Hadoop: Why?

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
>
>
>    - Large block size wastes space for small file.  The minimum file size
>    is 1 block.
>
> That's incorrect. If a file is smaller than the block size, it will only
consume as much space as there is data in the file.

>
>    - There are no hardlinks, softlinks, or quotas.
>
> That's incorrect; there are quotas and softlinks.

Re: HDFS without Hadoop: Why?

Posted by Gerrit Jansen van Vuuren <ge...@googlemail.com>.
 The smallest size in HDFS is not the blocksize. The blocksize is an upper
limit, but if you store smaller files it will not take up extra space.
  HDFS is not meant for fast random access but built specifically for large
files and sequential access.


On Wed, Jan 26, 2011 at 9:59 AM, Gerrit Jansen van Vuuren <
gerritjvv@googlemail.com> wrote:

> Hi,
>
> For true data durability RAID is not enough.
> The conditions I operate on are the following:
>
> (1) Data loss is not acceptable under any terms
> (2) Data unavailability is not acceptable under any terms for any period of
> time.
> (3) Data loss for certain data sets become a legal issue and is again not
> acceptable, an might lead to loss of my employment.
> (4) Having 2 nodes fail in a month on average under for volumes we operate
> is to be expected, i.e. 100 to 400 nodes per cluster.
> (5) Having a data centre outage once a year is to be expected. (We've
> already had one this year)
>
> A word on node failure: Nodes do not just fail because of disks, any
> component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc.
>
> Now data loss or unavailability can happen under the following conditions:
> (1) Multiple of single disk failure
> (2) Node failure (a whole U goes down)
> (3) Rack failure
> (4) Data Centre failure
>
> Raid covers (1) but I do not know of any raid setup that will cover the
> rest.
> HDFS with 3 way replication covers 1,2, and 3 but not 4.
> HDFS 3 way replication with replication (via distcp) across data centres
> covers 1-4.
>
> The question to ask business is how valuable is the data in question to
> them? If they go RAID and only cover (1), they should be asked if its
> acceptable to have data unavailable with the possibility of permanent data
> loss at any point of time for any amount of data for any amount of time.
> If they come back to you and say yes we accept that if a node fails we
> loose data or that it becomes unavailable for any period of time, then yes
> go for RAID. If the answer is NO, you need replication, even DBAs understand
> this and thats why for DBs we backup, replicate and load/fail-over balance,
> why should we not do them same for critical business data on file storage?
>
>
> We run all of our nodes non raided (JBOD), because having 3 replicas means
> you don't require extra replicas on the same disk or node.
>
> Yes its true that any distributed file system will make data available to
> any number of nodes but this was not my point earlier. Having data replicas
> on multiple nodes means that data can be worked from in parallel on multiple
> physical nodes without requiring to read/copy the data from a single node.
>
> Cheers,
>  Gerrit
>
>
> On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <dh...@gmail.com>wrote:
>
>> Hi Nathan,
>>
>> we are using HDFS-RAID for our 30 PB cluster. Most datasets have a
>> replication factor of 2.2 and a few datasets have a replication factor of
>> 1.4.  Some details here:
>>
>> http://wiki.apache.org/hadoop/HDFS-RAID
>>
>> http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
>>
>> thanks,
>> dhruba
>>
>>
>> On Tue, Jan 25, 2011 at 7:58 PM, <st...@yahoo.com> wrote:
>>
>>> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed
>>> file system that solves different problems.
>>>
>>>
>>>  HDFS is a file system. It's like asking NTFS or RAID?
>>>
>>> >but can be generally dealt with using hardware and software failover
>>> techniques.
>>>
>>> Like hdfs.
>>>
>>> Best,
>>>  -stu
>>> -----Original Message-----
>>> From: Nathan Rutman <nr...@gmail.com>
>>> Date: Tue, 25 Jan 2011 17:31:25
>>> To: <hd...@hadoop.apache.org>
>>> Reply-To: hdfs-user@hadoop.apache.org
>>> Subject: Re: HDFS without Hadoop: Why?
>>>
>>>
>>> On Jan 25, 2011, at 5:08 PM, stu24mail@yahoo.com wrote:
>>>
>>> > I don't think, as a recovery strategy, RAID scales to large amounts of
>>> data. Even as some kind of attached storage device (e.g. Vtrack), you're
>>> only talking about a few terabytes of data, and it doesn't tolerate node
>>> failure.
>>>
>>> When talking about large amounts of data, 3x redundancy absolutely
>>> doesn't scale.  Nobody is going to pay for 3 petabytes worth of disk if they
>>> only need 1 PB worth of data.  This is where dedicated high-end raid systems
>>> come in (this is in fact what my company, Xyratex, builds).  Redundant
>>> controllers, battery backup, etc.  The incremental cost for an additional
>>> drive in such systems is negligible.
>>>
>>> >
>>> > A key part of hdfs is the distributed part.
>>>
>>> Granted, single-point-of-failure arguments are valid when concentrating
>>> all the storage together, but can be generally dealt with using hardware and
>>> software failover techniques.
>>>
>>> The scale argument in my mind is exactly reversed -- HDFS works fine for
>>> smaller installations that can't afford RAID hardware overhead and access
>>> redundancy, and where buying 30 drives instead of 10 is an acceptable cost
>>> for the simplicity of HDFS setup.
>>>
>>> >
>>> > Best,
>>> > -stu
>>> > -----Original Message-----
>>> > From: Nathan Rutman <nr...@gmail.com>
>>> > Date: Tue, 25 Jan 2011 16:32:07
>>> > To: <hd...@hadoop.apache.org>
>>> > Reply-To: hdfs-user@hadoop.apache.org
>>> > Subject: Re: HDFS without Hadoop: Why?
>>> >
>>> >
>>> > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> Why would 3x data seem wasteful?
>>> >> This is exactly what you want.  I would never store any serious
>>> business data without some form of replication.
>>> >
>>> > I agree that you want data backup, but 3x replication is the least
>>> efficient / most expensive (space-wise) way to do it.  This is what RAID was
>>> invented for: RAID 6 gives you fault tolerance against loss of any two
>>> drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note
>>> this in my original email, but that's what I had in mind.) RAID is also not
>>> necessarily $ expensive either; Linux MD RAID is free and effective.
>>> >
>>> >> What happens if you store a single file on a single server without
>>> replicas and that server goes, or just the disk on that the file is on goes
>>> ? HDFS and any decent distributed file system uses replication to prevent
>>> data loss. As a side affect having the same replica of a data piece on
>>> separate servers means that more than one task can work on the server in
>>> parallel.
>>> >
>>> > Indeed, replicated data does mean Hadoop could work on the same block
>>> on separate nodes.  But outside of Hadoop compute jobs, I don't think this
>>> is useful in general.  And in any case, a distributed filesystem would let
>>> you work on the same block of data from however many nodes you wanted.
>>> >
>>> >
>>>
>>>
>>
>>
>> --
>> Connect to me at http://www.facebook.com/dhruba
>>
>
>

Re: HDFS without Hadoop: Why?

Posted by st...@yahoo.com.
I believe for most people, the answer is "Yes"

-----Original Message-----
From: Nathan Rutman <nr...@gmail.com>
Date: Wed, 26 Jan 2011 09:41:37 
To: <hd...@hadoop.apache.org>
Reply-To: hdfs-user@hadoop.apache.org
Subject: Re: HDFS without Hadoop: Why?

Ok.  Is your statement, "I use HDFS for general-purpose data storage because it does this replication well", or is it more, "the most important benefit of using HDFS as the Map-Reduce or HBase backend fs is data safety."  In other words, I'd like to relate this back to my original question of the broader usage of HDFS - does it make sense to use HDFS outside of the special application space for which it was designed?


On Jan 26, 2011, at 1:59 AM, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> For true data durability RAID is not enough.
> The conditions I operate on are the following:
> 
> (1) Data loss is not acceptable under any terms
> (2) Data unavailability is not acceptable under any terms for any period of time.
> (3) Data loss for certain data sets become a legal issue and is again not acceptable, an might lead to loss of my employment.
> (4) Having 2 nodes fail in a month on average under for volumes we operate is to be expected, i.e. 100 to 400 nodes per cluster.
> (5) Having a data centre outage once a year is to be expected. (We've already had one this year)
> 
> A word on node failure: Nodes do not just fail because of disks, any component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc.
> 
> Now data loss or unavailability can happen under the following conditions:
> (1) Multiple of single disk failure
> (2) Node failure (a whole U goes down)
> (3) Rack failure
> (4) Data Centre failure
> 
> Raid covers (1) but I do not know of any raid setup that will cover the rest. 
> HDFS with 3 way replication covers 1,2, and 3 but not 4.
> HDFS 3 way replication with replication (via distcp) across data centres covers 1-4.
> 
> The question to ask business is how valuable is the data in question to them? If they go RAID and only cover (1), they should be asked if its acceptable to have data unavailable with the possibility of permanent data loss at any point of time for any amount of data for any amount of time.
> If they come back to you and say yes we accept that if a node fails we loose data or that it becomes unavailable for any period of time, then yes go for RAID. If the answer is NO, you need replication, even DBAs understand this and thats why for DBs we backup, replicate and load/fail-over balance, why should we not do them same for critical business data on file storage?
> 
> 
> We run all of our nodes non raided (JBOD), because having 3 replicas means you don't require extra replicas on the same disk or node.
> 
> Yes its true that any distributed file system will make data available to any number of nodes but this was not my point earlier. Having data replicas on multiple nodes means that data can be worked from in parallel on multiple physical nodes without requiring to read/copy the data from a single node.
> 
> Cheers,
>  Gerrit
> 
> 
> On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> Hi Nathan,
> 
> we are using HDFS-RAID for our 30 PB cluster. Most datasets have a replication factor of 2.2 and a few datasets have a replication factor of 1.4.  Some details here:
> 
> http://wiki.apache.org/hadoop/HDFS-RAID
> http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
> 
> thanks,
> dhruba
> 
> 
> On Tue, Jan 25, 2011 at 7:58 PM, <st...@yahoo.com> wrote:
> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed file system that solves different problems.
> 
> 
>  HDFS is a file system. It's like asking NTFS or RAID?
> 
> >but can be generally dealt with using hardware and software failover techniques.
> 
> Like hdfs.
> 
> Best,
>  -stu
> -----Original Message-----
> From: Nathan Rutman <nr...@gmail.com>
> Date: Tue, 25 Jan 2011 17:31:25
> To: <hd...@hadoop.apache.org>
> Reply-To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
> 
> 
> On Jan 25, 2011, at 5:08 PM, stu24mail@yahoo.com wrote:
> 
> > I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes of data, and it doesn't tolerate node failure.
> 
> When talking about large amounts of data, 3x redundancy absolutely doesn't scale.  Nobody is going to pay for 3 petabytes worth of disk if they only need 1 PB worth of data.  This is where dedicated high-end raid systems come in (this is in fact what my company, Xyratex, builds).  Redundant controllers, battery backup, etc.  The incremental cost for an additional drive in such systems is negligible.
> 
> >
> > A key part of hdfs is the distributed part.
> 
> Granted, single-point-of-failure arguments are valid when concentrating all the storage together, but can be generally dealt with using hardware and software failover techniques.
> 
> The scale argument in my mind is exactly reversed -- HDFS works fine for smaller installations that can't afford RAID hardware overhead and access redundancy, and where buying 30 drives instead of 10 is an acceptable cost for the simplicity of HDFS setup.
> 
> >
> > Best,
> > -stu
> > -----Original Message-----
> > From: Nathan Rutman <nr...@gmail.com>
> > Date: Tue, 25 Jan 2011 16:32:07
> > To: <hd...@hadoop.apache.org>
> > Reply-To: hdfs-user@hadoop.apache.org
> > Subject: Re: HDFS without Hadoop: Why?
> >
> >
> > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
> >
> >> Hi,
> >>
> >> Why would 3x data seem wasteful?
> >> This is exactly what you want.  I would never store any serious business data without some form of replication.
> >
> > I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it.  This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective.
> >
> >> What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel.
> >
> > Indeed, replicated data does mean Hadoop could work on the same block on separate nodes.  But outside of Hadoop compute jobs, I don't think this is useful in general.  And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted.
> >
> >
> 
> 
> 
> 
> -- 
> Connect to me at http://www.facebook.com/dhruba
> 



Re: HDFS without Hadoop: Why?

Posted by Gerrit Jansen van Vuuren <ge...@googlemail.com>.
For me it depends on the requirements for the data, but if
I'm responsible for it and the data is deemed critical my choice would be
HDFS.

One more example:
Currently where I work we are thinking of backing up data for at least 5-7
years, suddenly all sorts of issues come into play like disk bit rot with
offline backups. Tape?? not a choice where I'm working. We're evaluating
using big 48 2TB disk Us with 2 core cpu(s) in an HDFS setup (no mapreduce)
to just store the data. HDFS will take care of disks failing and because
blocks get calculated on a 3 week cycle for checksums the issue of bit rot
is eliminated also.



On Thu, Jan 27, 2011 at 3:04 AM, <st...@yahoo.com> wrote:

> I believe for most people, the answer is "Yes"
> ------------------------------
> *From: * Nathan Rutman <nr...@gmail.com>
> *Date: *Wed, 26 Jan 2011 09:41:37 -0800
> *To: *<hd...@hadoop.apache.org>
> *ReplyTo: * hdfs-user@hadoop.apache.org
> *Subject: *Re: HDFS without Hadoop: Why?
>
> Ok.  Is your statement, "I use HDFS for general-purpose data storage
> because it does this replication well", or is it more, "the most important
> benefit of using HDFS as the Map-Reduce or HBase backend fs is data safety."
>  In other words, I'd like to relate this back to my original question of the
> broader usage of HDFS - does it make sense to use HDFS outside of the
> special application space for which it was designed?
>
>
> On Jan 26, 2011, at 1:59 AM, Gerrit Jansen van Vuuren wrote:
>
> Hi,
>
> For true data durability RAID is not enough.
> The conditions I operate on are the following:
>
> (1) Data loss is not acceptable under any terms
> (2) Data unavailability is not acceptable under any terms for any period of
> time.
> (3) Data loss for certain data sets become a legal issue and is again not
> acceptable, an might lead to loss of my employment.
> (4) Having 2 nodes fail in a month on average under for volumes we operate
> is to be expected, i.e. 100 to 400 nodes per cluster.
> (5) Having a data centre outage once a year is to be expected. (We've
> already had one this year)
>
> A word on node failure: Nodes do not just fail because of disks, any
> component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc.
>
> Now data loss or unavailability can happen under the following conditions:
> (1) Multiple of single disk failure
> (2) Node failure (a whole U goes down)
> (3) Rack failure
> (4) Data Centre failure
>
> Raid covers (1) but I do not know of any raid setup that will cover the
> rest.
> HDFS with 3 way replication covers 1,2, and 3 but not 4.
> HDFS 3 way replication with replication (via distcp) across data centres
> covers 1-4.
>
> The question to ask business is how valuable is the data in question to
> them? If they go RAID and only cover (1), they should be asked if its
> acceptable to have data unavailable with the possibility of permanent data
> loss at any point of time for any amount of data for any amount of time.
> If they come back to you and say yes we accept that if a node fails we
> loose data or that it becomes unavailable for any period of time, then yes
> go for RAID. If the answer is NO, you need replication, even DBAs understand
> this and thats why for DBs we backup, replicate and load/fail-over balance,
> why should we not do them same for critical business data on file storage?
>
>
> We run all of our nodes non raided (JBOD), because having 3 replicas means
> you don't require extra replicas on the same disk or node.
>
> Yes its true that any distributed file system will make data available to
> any number of nodes but this was not my point earlier. Having data replicas
> on multiple nodes means that data can be worked from in parallel on multiple
> physical nodes without requiring to read/copy the data from a single node.
>
> Cheers,
>  Gerrit
>
>
> On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <dh...@gmail.com>wrote:
>
>> Hi Nathan,
>>
>> we are using HDFS-RAID for our 30 PB cluster. Most datasets have a
>> replication factor of 2.2 and a few datasets have a replication factor of
>> 1.4.  Some details here:
>>
>> http://wiki.apache.org/hadoop/HDFS-RAID
>>
>> http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
>>
>> thanks,
>> dhruba
>>
>>
>> On Tue, Jan 25, 2011 at 7:58 PM, <st...@yahoo.com> wrote:
>>
>>> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed
>>> file system that solves different problems.
>>>
>>>
>>>  HDFS is a file system. It's like asking NTFS or RAID?
>>>
>>> >but can be generally dealt with using hardware and software failover
>>> techniques.
>>>
>>> Like hdfs.
>>>
>>> Best,
>>>  -stu
>>> -----Original Message-----
>>> From: Nathan Rutman <nr...@gmail.com>
>>> Date: Tue, 25 Jan 2011 17:31:25
>>> To: <hd...@hadoop.apache.org>
>>> Reply-To: hdfs-user@hadoop.apache.org
>>> Subject: Re: HDFS without Hadoop: Why?
>>>
>>>
>>> On Jan 25, 2011, at 5:08 PM, stu24mail@yahoo.com wrote:
>>>
>>> > I don't think, as a recovery strategy, RAID scales to large amounts of
>>> data. Even as some kind of attached storage device (e.g. Vtrack), you're
>>> only talking about a few terabytes of data, and it doesn't tolerate node
>>> failure.
>>>
>>> When talking about large amounts of data, 3x redundancy absolutely
>>> doesn't scale.  Nobody is going to pay for 3 petabytes worth of disk if they
>>> only need 1 PB worth of data.  This is where dedicated high-end raid systems
>>> come in (this is in fact what my company, Xyratex, builds).  Redundant
>>> controllers, battery backup, etc.  The incremental cost for an additional
>>> drive in such systems is negligible.
>>>
>>> >
>>> > A key part of hdfs is the distributed part.
>>>
>>> Granted, single-point-of-failure arguments are valid when concentrating
>>> all the storage together, but can be generally dealt with using hardware and
>>> software failover techniques.
>>>
>>> The scale argument in my mind is exactly reversed -- HDFS works fine for
>>> smaller installations that can't afford RAID hardware overhead and access
>>> redundancy, and where buying 30 drives instead of 10 is an acceptable cost
>>> for the simplicity of HDFS setup.
>>>
>>> >
>>> > Best,
>>> > -stu
>>> > -----Original Message-----
>>> > From: Nathan Rutman <nr...@gmail.com>
>>> > Date: Tue, 25 Jan 2011 16:32:07
>>> > To: <hd...@hadoop.apache.org>
>>> > Reply-To: hdfs-user@hadoop.apache.org
>>> > Subject: Re: HDFS without Hadoop: Why?
>>> >
>>> >
>>> > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> Why would 3x data seem wasteful?
>>> >> This is exactly what you want.  I would never store any serious
>>> business data without some form of replication.
>>> >
>>> > I agree that you want data backup, but 3x replication is the least
>>> efficient / most expensive (space-wise) way to do it.  This is what RAID was
>>> invented for: RAID 6 gives you fault tolerance against loss of any two
>>> drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note
>>> this in my original email, but that's what I had in mind.) RAID is also not
>>> necessarily $ expensive either; Linux MD RAID is free and effective.
>>> >
>>> >> What happens if you store a single file on a single server without
>>> replicas and that server goes, or just the disk on that the file is on goes
>>> ? HDFS and any decent distributed file system uses replication to prevent
>>> data loss. As a side affect having the same replica of a data piece on
>>> separate servers means that more than one task can work on the server in
>>> parallel.
>>> >
>>> > Indeed, replicated data does mean Hadoop could work on the same block
>>> on separate nodes.  But outside of Hadoop compute jobs, I don't think this
>>> is useful in general.  And in any case, a distributed filesystem would let
>>> you work on the same block of data from however many nodes you wanted.
>>> >
>>> >
>>>
>>>
>>
>>
>> --
>> Connect to me at http://www.facebook.com/dhruba
>>
>
>
>

Re: HDFS without Hadoop: Why?

Posted by Nathan Rutman <nr...@gmail.com>.
Ok.  Is your statement, "I use HDFS for general-purpose data storage because it does this replication well", or is it more, "the most important benefit of using HDFS as the Map-Reduce or HBase backend fs is data safety."  In other words, I'd like to relate this back to my original question of the broader usage of HDFS - does it make sense to use HDFS outside of the special application space for which it was designed?


On Jan 26, 2011, at 1:59 AM, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> For true data durability RAID is not enough.
> The conditions I operate on are the following:
> 
> (1) Data loss is not acceptable under any terms
> (2) Data unavailability is not acceptable under any terms for any period of time.
> (3) Data loss for certain data sets become a legal issue and is again not acceptable, an might lead to loss of my employment.
> (4) Having 2 nodes fail in a month on average under for volumes we operate is to be expected, i.e. 100 to 400 nodes per cluster.
> (5) Having a data centre outage once a year is to be expected. (We've already had one this year)
> 
> A word on node failure: Nodes do not just fail because of disks, any component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc.
> 
> Now data loss or unavailability can happen under the following conditions:
> (1) Multiple of single disk failure
> (2) Node failure (a whole U goes down)
> (3) Rack failure
> (4) Data Centre failure
> 
> Raid covers (1) but I do not know of any raid setup that will cover the rest. 
> HDFS with 3 way replication covers 1,2, and 3 but not 4.
> HDFS 3 way replication with replication (via distcp) across data centres covers 1-4.
> 
> The question to ask business is how valuable is the data in question to them? If they go RAID and only cover (1), they should be asked if its acceptable to have data unavailable with the possibility of permanent data loss at any point of time for any amount of data for any amount of time.
> If they come back to you and say yes we accept that if a node fails we loose data or that it becomes unavailable for any period of time, then yes go for RAID. If the answer is NO, you need replication, even DBAs understand this and thats why for DBs we backup, replicate and load/fail-over balance, why should we not do them same for critical business data on file storage?
> 
> 
> We run all of our nodes non raided (JBOD), because having 3 replicas means you don't require extra replicas on the same disk or node.
> 
> Yes its true that any distributed file system will make data available to any number of nodes but this was not my point earlier. Having data replicas on multiple nodes means that data can be worked from in parallel on multiple physical nodes without requiring to read/copy the data from a single node.
> 
> Cheers,
>  Gerrit
> 
> 
> On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> Hi Nathan,
> 
> we are using HDFS-RAID for our 30 PB cluster. Most datasets have a replication factor of 2.2 and a few datasets have a replication factor of 1.4.  Some details here:
> 
> http://wiki.apache.org/hadoop/HDFS-RAID
> http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
> 
> thanks,
> dhruba
> 
> 
> On Tue, Jan 25, 2011 at 7:58 PM, <st...@yahoo.com> wrote:
> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed file system that solves different problems.
> 
> 
>  HDFS is a file system. It's like asking NTFS or RAID?
> 
> >but can be generally dealt with using hardware and software failover techniques.
> 
> Like hdfs.
> 
> Best,
>  -stu
> -----Original Message-----
> From: Nathan Rutman <nr...@gmail.com>
> Date: Tue, 25 Jan 2011 17:31:25
> To: <hd...@hadoop.apache.org>
> Reply-To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
> 
> 
> On Jan 25, 2011, at 5:08 PM, stu24mail@yahoo.com wrote:
> 
> > I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes of data, and it doesn't tolerate node failure.
> 
> When talking about large amounts of data, 3x redundancy absolutely doesn't scale.  Nobody is going to pay for 3 petabytes worth of disk if they only need 1 PB worth of data.  This is where dedicated high-end raid systems come in (this is in fact what my company, Xyratex, builds).  Redundant controllers, battery backup, etc.  The incremental cost for an additional drive in such systems is negligible.
> 
> >
> > A key part of hdfs is the distributed part.
> 
> Granted, single-point-of-failure arguments are valid when concentrating all the storage together, but can be generally dealt with using hardware and software failover techniques.
> 
> The scale argument in my mind is exactly reversed -- HDFS works fine for smaller installations that can't afford RAID hardware overhead and access redundancy, and where buying 30 drives instead of 10 is an acceptable cost for the simplicity of HDFS setup.
> 
> >
> > Best,
> > -stu
> > -----Original Message-----
> > From: Nathan Rutman <nr...@gmail.com>
> > Date: Tue, 25 Jan 2011 16:32:07
> > To: <hd...@hadoop.apache.org>
> > Reply-To: hdfs-user@hadoop.apache.org
> > Subject: Re: HDFS without Hadoop: Why?
> >
> >
> > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
> >
> >> Hi,
> >>
> >> Why would 3x data seem wasteful?
> >> This is exactly what you want.  I would never store any serious business data without some form of replication.
> >
> > I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it.  This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective.
> >
> >> What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel.
> >
> > Indeed, replicated data does mean Hadoop could work on the same block on separate nodes.  But outside of Hadoop compute jobs, I don't think this is useful in general.  And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted.
> >
> >
> 
> 
> 
> 
> -- 
> Connect to me at http://www.facebook.com/dhruba
> 


Re: HDFS without Hadoop: Why?

Posted by Gerrit Jansen van Vuuren <ge...@googlemail.com>.
Hi,

For true data durability RAID is not enough.
The conditions I operate on are the following:

(1) Data loss is not acceptable under any terms
(2) Data unavailability is not acceptable under any terms for any period of
time.
(3) Data loss for certain data sets become a legal issue and is again not
acceptable, an might lead to loss of my employment.
(4) Having 2 nodes fail in a month on average under for volumes we operate
is to be expected, i.e. 100 to 400 nodes per cluster.
(5) Having a data centre outage once a year is to be expected. (We've
already had one this year)

A word on node failure: Nodes do not just fail because of disks, any
component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc.

Now data loss or unavailability can happen under the following conditions:
(1) Multiple of single disk failure
(2) Node failure (a whole U goes down)
(3) Rack failure
(4) Data Centre failure

Raid covers (1) but I do not know of any raid setup that will cover the
rest.
HDFS with 3 way replication covers 1,2, and 3 but not 4.
HDFS 3 way replication with replication (via distcp) across data centres
covers 1-4.

The question to ask business is how valuable is the data in question to
them? If they go RAID and only cover (1), they should be asked if its
acceptable to have data unavailable with the possibility of permanent data
loss at any point of time for any amount of data for any amount of time.
If they come back to you and say yes we accept that if a node fails we loose
data or that it becomes unavailable for any period of time, then yes go for
RAID. If the answer is NO, you need replication, even DBAs understand this
and thats why for DBs we backup, replicate and load/fail-over balance, why
should we not do them same for critical business data on file storage?


We run all of our nodes non raided (JBOD), because having 3 replicas means
you don't require extra replicas on the same disk or node.

Yes its true that any distributed file system will make data available to
any number of nodes but this was not my point earlier. Having data replicas
on multiple nodes means that data can be worked from in parallel on multiple
physical nodes without requiring to read/copy the data from a single node.

Cheers,
 Gerrit


On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <dh...@gmail.com> wrote:

> Hi Nathan,
>
> we are using HDFS-RAID for our 30 PB cluster. Most datasets have a
> replication factor of 2.2 and a few datasets have a replication factor of
> 1.4.  Some details here:
>
> http://wiki.apache.org/hadoop/HDFS-RAID
>
> http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
>
> thanks,
> dhruba
>
>
> On Tue, Jan 25, 2011 at 7:58 PM, <st...@yahoo.com> wrote:
>
>> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed
>> file system that solves different problems.
>>
>>
>>  HDFS is a file system. It's like asking NTFS or RAID?
>>
>> >but can be generally dealt with using hardware and software failover
>> techniques.
>>
>> Like hdfs.
>>
>> Best,
>>  -stu
>> -----Original Message-----
>> From: Nathan Rutman <nr...@gmail.com>
>> Date: Tue, 25 Jan 2011 17:31:25
>> To: <hd...@hadoop.apache.org>
>> Reply-To: hdfs-user@hadoop.apache.org
>> Subject: Re: HDFS without Hadoop: Why?
>>
>>
>> On Jan 25, 2011, at 5:08 PM, stu24mail@yahoo.com wrote:
>>
>> > I don't think, as a recovery strategy, RAID scales to large amounts of
>> data. Even as some kind of attached storage device (e.g. Vtrack), you're
>> only talking about a few terabytes of data, and it doesn't tolerate node
>> failure.
>>
>> When talking about large amounts of data, 3x redundancy absolutely doesn't
>> scale.  Nobody is going to pay for 3 petabytes worth of disk if they only
>> need 1 PB worth of data.  This is where dedicated high-end raid systems come
>> in (this is in fact what my company, Xyratex, builds).  Redundant
>> controllers, battery backup, etc.  The incremental cost for an additional
>> drive in such systems is negligible.
>>
>> >
>> > A key part of hdfs is the distributed part.
>>
>> Granted, single-point-of-failure arguments are valid when concentrating
>> all the storage together, but can be generally dealt with using hardware and
>> software failover techniques.
>>
>> The scale argument in my mind is exactly reversed -- HDFS works fine for
>> smaller installations that can't afford RAID hardware overhead and access
>> redundancy, and where buying 30 drives instead of 10 is an acceptable cost
>> for the simplicity of HDFS setup.
>>
>> >
>> > Best,
>> > -stu
>> > -----Original Message-----
>> > From: Nathan Rutman <nr...@gmail.com>
>> > Date: Tue, 25 Jan 2011 16:32:07
>> > To: <hd...@hadoop.apache.org>
>> > Reply-To: hdfs-user@hadoop.apache.org
>> > Subject: Re: HDFS without Hadoop: Why?
>> >
>> >
>> > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
>> >
>> >> Hi,
>> >>
>> >> Why would 3x data seem wasteful?
>> >> This is exactly what you want.  I would never store any serious
>> business data without some form of replication.
>> >
>> > I agree that you want data backup, but 3x replication is the least
>> efficient / most expensive (space-wise) way to do it.  This is what RAID was
>> invented for: RAID 6 gives you fault tolerance against loss of any two
>> drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note
>> this in my original email, but that's what I had in mind.) RAID is also not
>> necessarily $ expensive either; Linux MD RAID is free and effective.
>> >
>> >> What happens if you store a single file on a single server without
>> replicas and that server goes, or just the disk on that the file is on goes
>> ? HDFS and any decent distributed file system uses replication to prevent
>> data loss. As a side affect having the same replica of a data piece on
>> separate servers means that more than one task can work on the server in
>> parallel.
>> >
>> > Indeed, replicated data does mean Hadoop could work on the same block on
>> separate nodes.  But outside of Hadoop compute jobs, I don't think this is
>> useful in general.  And in any case, a distributed filesystem would let you
>> work on the same block of data from however many nodes you wanted.
>> >
>> >
>>
>>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: HDFS without Hadoop: Why?

Posted by Dhruba Borthakur <dh...@gmail.com>.
Hi Nathan,

we are using HDFS-RAID for our 30 PB cluster. Most datasets have a
replication factor of 2.2 and a few datasets have a replication factor of
1.4.  Some details here:

http://wiki.apache.org/hadoop/HDFS-RAID
http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html

thanks,
dhruba


On Tue, Jan 25, 2011 at 7:58 PM, <st...@yahoo.com> wrote:

> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed file
> system that solves different problems.
>
>  HDFS is a file system. It's like asking NTFS or RAID?
>
> >but can be generally dealt with using hardware and software failover
> techniques.
>
> Like hdfs.
>
> Best,
>  -stu
> -----Original Message-----
> From: Nathan Rutman <nr...@gmail.com>
> Date: Tue, 25 Jan 2011 17:31:25
> To: <hd...@hadoop.apache.org>
> Reply-To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
>
>
> On Jan 25, 2011, at 5:08 PM, stu24mail@yahoo.com wrote:
>
> > I don't think, as a recovery strategy, RAID scales to large amounts of
> data. Even as some kind of attached storage device (e.g. Vtrack), you're
> only talking about a few terabytes of data, and it doesn't tolerate node
> failure.
>
> When talking about large amounts of data, 3x redundancy absolutely doesn't
> scale.  Nobody is going to pay for 3 petabytes worth of disk if they only
> need 1 PB worth of data.  This is where dedicated high-end raid systems come
> in (this is in fact what my company, Xyratex, builds).  Redundant
> controllers, battery backup, etc.  The incremental cost for an additional
> drive in such systems is negligible.
>
> >
> > A key part of hdfs is the distributed part.
>
> Granted, single-point-of-failure arguments are valid when concentrating all
> the storage together, but can be generally dealt with using hardware and
> software failover techniques.
>
> The scale argument in my mind is exactly reversed -- HDFS works fine for
> smaller installations that can't afford RAID hardware overhead and access
> redundancy, and where buying 30 drives instead of 10 is an acceptable cost
> for the simplicity of HDFS setup.
>
> >
> > Best,
> > -stu
> > -----Original Message-----
> > From: Nathan Rutman <nr...@gmail.com>
> > Date: Tue, 25 Jan 2011 16:32:07
> > To: <hd...@hadoop.apache.org>
> > Reply-To: hdfs-user@hadoop.apache.org
> > Subject: Re: HDFS without Hadoop: Why?
> >
> >
> > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
> >
> >> Hi,
> >>
> >> Why would 3x data seem wasteful?
> >> This is exactly what you want.  I would never store any serious business
> data without some form of replication.
> >
> > I agree that you want data backup, but 3x replication is the least
> efficient / most expensive (space-wise) way to do it.  This is what RAID was
> invented for: RAID 6 gives you fault tolerance against loss of any two
> drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note
> this in my original email, but that's what I had in mind.) RAID is also not
> necessarily $ expensive either; Linux MD RAID is free and effective.
> >
> >> What happens if you store a single file on a single server without
> replicas and that server goes, or just the disk on that the file is on goes
> ? HDFS and any decent distributed file system uses replication to prevent
> data loss. As a side affect having the same replica of a data piece on
> separate servers means that more than one task can work on the server in
> parallel.
> >
> > Indeed, replicated data does mean Hadoop could work on the same block on
> separate nodes.  But outside of Hadoop compute jobs, I don't think this is
> useful in general.  And in any case, a distributed filesystem would let you
> work on the same block of data from however many nodes you wanted.
> >
> >
>
>


-- 
Connect to me at http://www.facebook.com/dhruba

Re: HDFS without Hadoop: Why?

Posted by st...@yahoo.com.
My point was it's not RAID or whatever versus HDFS. HDFS is a distributed file system that solves different problems.

  HDFS is a file system. It's like asking NTFS or RAID? 

>but can be generally dealt with using hardware and software failover techniques.   

Like hdfs.

Best,
 -stu
-----Original Message-----
From: Nathan Rutman <nr...@gmail.com>
Date: Tue, 25 Jan 2011 17:31:25 
To: <hd...@hadoop.apache.org>
Reply-To: hdfs-user@hadoop.apache.org
Subject: Re: HDFS without Hadoop: Why?


On Jan 25, 2011, at 5:08 PM, stu24mail@yahoo.com wrote:

> I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes of data, and it doesn't tolerate node failure.

When talking about large amounts of data, 3x redundancy absolutely doesn't scale.  Nobody is going to pay for 3 petabytes worth of disk if they only need 1 PB worth of data.  This is where dedicated high-end raid systems come in (this is in fact what my company, Xyratex, builds).  Redundant controllers, battery backup, etc.  The incremental cost for an additional drive in such systems is negligible.  

> 
> A key part of hdfs is the distributed part.

Granted, single-point-of-failure arguments are valid when concentrating all the storage together, but can be generally dealt with using hardware and software failover techniques.   

The scale argument in my mind is exactly reversed -- HDFS works fine for smaller installations that can't afford RAID hardware overhead and access redundancy, and where buying 30 drives instead of 10 is an acceptable cost for the simplicity of HDFS setup.

> 
> Best,
> -stu
> -----Original Message-----
> From: Nathan Rutman <nr...@gmail.com>
> Date: Tue, 25 Jan 2011 16:32:07 
> To: <hd...@hadoop.apache.org>
> Reply-To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
> 
> 
> On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
> 
>> Hi,
>> 
>> Why would 3x data seem wasteful? 
>> This is exactly what you want.  I would never store any serious business data without some form of replication.
> 
> I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it.  This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective.
> 
>> What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel.
> 
> Indeed, replicated data does mean Hadoop could work on the same block on separate nodes.  But outside of Hadoop compute jobs, I don't think this is useful in general.  And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted.
> 
> 


Re: HDFS without Hadoop: Why?

Posted by Nathan Rutman <nr...@gmail.com>.
On Jan 25, 2011, at 5:08 PM, stu24mail@yahoo.com wrote:

> I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes of data, and it doesn't tolerate node failure.

When talking about large amounts of data, 3x redundancy absolutely doesn't scale.  Nobody is going to pay for 3 petabytes worth of disk if they only need 1 PB worth of data.  This is where dedicated high-end raid systems come in (this is in fact what my company, Xyratex, builds).  Redundant controllers, battery backup, etc.  The incremental cost for an additional drive in such systems is negligible.  

> 
> A key part of hdfs is the distributed part.

Granted, single-point-of-failure arguments are valid when concentrating all the storage together, but can be generally dealt with using hardware and software failover techniques.   

The scale argument in my mind is exactly reversed -- HDFS works fine for smaller installations that can't afford RAID hardware overhead and access redundancy, and where buying 30 drives instead of 10 is an acceptable cost for the simplicity of HDFS setup.

> 
> Best,
> -stu
> -----Original Message-----
> From: Nathan Rutman <nr...@gmail.com>
> Date: Tue, 25 Jan 2011 16:32:07 
> To: <hd...@hadoop.apache.org>
> Reply-To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
> 
> 
> On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
> 
>> Hi,
>> 
>> Why would 3x data seem wasteful? 
>> This is exactly what you want.  I would never store any serious business data without some form of replication.
> 
> I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it.  This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective.
> 
>> What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel.
> 
> Indeed, replicated data does mean Hadoop could work on the same block on separate nodes.  But outside of Hadoop compute jobs, I don't think this is useful in general.  And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted.
> 
> 


Re: HDFS without Hadoop: Why?

Posted by st...@yahoo.com.
I don't think, as a recovery strategy, RAID scales to large amounts of data. Even as some kind of attached storage device (e.g. Vtrack), you're only talking about a few terabytes of data, and it doesn't tolerate node failure.

A key part of hdfs is the distributed part.

Best,
 -stu
-----Original Message-----
From: Nathan Rutman <nr...@gmail.com>
Date: Tue, 25 Jan 2011 16:32:07 
To: <hd...@hadoop.apache.org>
Reply-To: hdfs-user@hadoop.apache.org
Subject: Re: HDFS without Hadoop: Why?


On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> Why would 3x data seem wasteful? 
> This is exactly what you want.  I would never store any serious business data without some form of replication.

I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it.  This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective.

> What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel.

Indeed, replicated data does mean Hadoop could work on the same block on separate nodes.  But outside of Hadoop compute jobs, I don't think this is useful in general.  And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted.



Re: HDFS without Hadoop: Why?

Posted by Nathan Rutman <nr...@gmail.com>.
On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> Why would 3x data seem wasteful? 
> This is exactly what you want.  I would never store any serious business data without some form of replication.

I agree that you want data backup, but 3x replication is the least efficient / most expensive (space-wise) way to do it.  This is what RAID was invented for: RAID 6 gives you fault tolerance against loss of any two drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note this in my original email, but that's what I had in mind.) RAID is also not necessarily $ expensive either; Linux MD RAID is free and effective.

> What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel.

Indeed, replicated data does mean Hadoop could work on the same block on separate nodes.  But outside of Hadoop compute jobs, I don't think this is useful in general.  And in any case, a distributed filesystem would let you work on the same block of data from however many nodes you wanted.



Re: HDFS without Hadoop: Why?

Posted by Gerrit Jansen van Vuuren <ge...@googlemail.com>.
Hi,

Why would 3x data seem wasteful?
This is exactly what you want.  I would never store any serious business
data without some form of replication.
What happens if you store a single file on a single server without replicas
and that server goes, or just the disk on that the file is on goes ? HDFS
and any decent distributed file system uses replication to prevent data
loss. As a side affect having the same replica of a data piece on separate
servers means that more than one task can work on the server in parallel.

I do agree that balancing data between disks on the same node is lacking
currently. Filling up a single disk in hdfs and adding a new one is not that
big a problem. Each DataNode will save blocks on the new disk and not just
keep on filling the already full disk. For data durability this is also not
a problem because with 3x replication you will have 1 replica on the local
node, 1 in the same rack plus another off rack.

With hadoop fs -ls <file> you get a listing of files , Yes its lacking, but
I've found that if you use awk you can get the columns you want, Its a bit
tedious I know.

Cheers,
 Gerrit


On Tue, Jan 25, 2011 at 10:05 PM, Scott Golby <sg...@conductor.com> wrote:

> Hi Nathan,
>
>
>
> > I have a very general question on the usefulness of HDFS for purposes
> other than running distributed compute jobs for Hadoop.  Hadoop and HDFS
> seem very popular these days, but the use of
>
> > HDFS for other purposes (database backend, records archiving, etc)
> confuses me, since there are other free distributed filesystems out there (I
> personally work on Lustre), with significantly better general-purpose
> performance.
>
>
>
> Coming it from a SysAdmin point of view I could say these things about HDFS
>
>
>
> -          It’s very easy to setup, basically 3 steps Build Box, Format
> NameNode, Start.  It’s easy to be up and running in Production mode HDFS in
> <4 hours with a 5+ machine cluster. (how big that “+” is dependent on your
> node deployment method, Kickstart, VMware, EC2 images, more than the
> complexity of HDFS setup)
>
> -          At least in Hadoop 0.20 it lacks a lot of “Ops” features,
> getting a simple “ls” through an API/hadoop as you mention.
>
> -          The 3x Data seems very wasteful to me
>
> -          Multidisk deployments are difficult for monitoring to deal
> with.  Adding a disk is easy, one xml file and a “,/new/disk_path”.   But,
> the 1st disk will go 100% and say there, while the new disk is perhaps 10%
> full (if you added the new disk when the old one was 90% full).   They fill
> at equal rates.   There is no easy way to balance the disks within the
> machine (the balance command works only across machines), so your Ops guys
> will get the “Disk 100% full” page 24x7 until they kill someone. :)
>
>
>
> I haven’t done any benchmarking, but my expectations for speed aren’t high
> with HDFS.
>
>
>
> Scott
>
>
>

RE: HDFS without Hadoop: Why?

Posted by Scott Golby <sg...@conductor.com>.
Hi Nathan,

> I have a very general question on the usefulness of HDFS for purposes other than running distributed compute jobs for Hadoop.  Hadoop and HDFS seem very popular these days, but the use of
> HDFS for other purposes (database backend, records archiving, etc) confuses me, since there are other free distributed filesystems out there (I personally work on Lustre), with significantly better general-purpose performance.

Coming it from a SysAdmin point of view I could say these things about HDFS


-          It's very easy to setup, basically 3 steps Build Box, Format NameNode, Start.  It's easy to be up and running in Production mode HDFS in <4 hours with a 5+ machine cluster. (how big that "+" is dependent on your node deployment method, Kickstart, VMware, EC2 images, more than the complexity of HDFS setup)

-          At least in Hadoop 0.20 it lacks a lot of "Ops" features, getting a simple "ls" through an API/hadoop as you mention.

-          The 3x Data seems very wasteful to me

-          Multidisk deployments are difficult for monitoring to deal with.  Adding a disk is easy, one xml file and a ",/new/disk_path".   But, the 1st disk will go 100% and say there, while the new disk is perhaps 10% full (if you added the new disk when the old one was 90% full).   They fill at equal rates.   There is no easy way to balance the disks within the machine (the balance command works only across machines), so your Ops guys will get the "Disk 100% full" page 24x7 until they kill someone. :)

I haven't done any benchmarking, but my expectations for speed aren't high with HDFS.

Scott


Re: HDFS without Hadoop: Why?

Posted by Friso van Vollenhoven <fv...@xebia.com>.
HBase is a database that runs on top of HDFS. So that's another one. It has an append-only usage pattern, which makes it a good fit.

I don't see how not-so-commodity hardware could go without replication to achieve the same as HDFS. It's not only about data safety, but also about availability. HDFS can survive complete machines dying, RAM banks going bad, motherboards going haywire and network partitions going off line because of switch failures. Anything that needs to survive a single box or single rack failure needs replication, I guess. That said, I think when you have two boxes doing 2PC for writes and each box is in itself setup with redundant storage (RAID or otherwise), you get a faster filesystem that is fully redundant. It would not make a nice fit for MapReduce though.


Friso




On 25 jan 2011, at 21:37, Nathan Rutman wrote:

I have a very general question on the usefulness of HDFS for purposes other than running distributed compute jobs for Hadoop.  Hadoop and HDFS seem very popular these days, but the use of HDFS for other purposes (database backend, records archiving, etc) confuses me, since there are other free distributed filesystems out there (I personally work on Lustre), with significantly better general-purpose performance.

So please tell me if I'm wrong about any of this.  Note I've gathered most of my info from documentation rather than reading the source code.

As I understand it, HDFS was written specifically for Hadoop compute jobs, with the following design factors in mind:

  *   write-once-read-many (worm) access model
  *   use commodity hardware with relatively high failures rates (i.e. assumptive failures)
  *   long, sequential streaming data access
  *   large files
  *   hardware/OS agnostic
  *   moving computation is cheaper than moving data

While appropriate for processing many large-input Hadoop data-processing jobs, there are significant penalties to be paid when trying to use these design factors for more general-purpose storage:

  *   Commodity hardware requires data replication for safety.  The HDFS implementation has three penalties: storage redundancy, network loading, and blocking writes.  By default, HDFS blocks are replicated 3x: local, "nearby", and "far away" to minimize the impact of data center catastrophe.  In addition to the obvious 3x cost for storage, the result is that every data block must be written "far away" - exactly the opposite of the "Move Computation to Data" mantra.  Furthermore, these over-network writes are synchronous; the client write blocks until all copies are complete on disk, with the longest latency path of 2 network hops plus a disk write gating the overall write speed.   Note that while this would be disastrous for a general-purpose filesystem, with true WORM usage it may be acceptable to penalize writes this way.
  *   Large block size implies fewer files.  HDFS reaches limits in the tens of millions of files.
  *   Large block size wastes space for small file.  The minimum file size is 1 block.
  *   There is no data caching.  When delivering large contiguous streaming data, this doesn't matter.  But when the read load is random, seeky, or partial, this is a missing high-impact performance feature.
  *   In a WORM model, changing a small part of a file requires all the file data to be copied, so e.g. database record modifications would be very expensive.
  *   There are no hardlinks, softlinks, or quotas.
  *   HDFS isn't directly mountable, and therefore requires a non-standard API to use.  (FUSE workaround exists.)
  *   Java source code is very portable and easy to install, but not very quick.
  *   Moving computation is cheaper than moving data.  But the data nonetheless always has to be moved: either read off of a local hard drive or read over the network into the compute node's memory.  It is not necessarily the case that reading a local hard drive is faster than reading a distributed (striped) file over a fast network.  Commodity network (e.g. 1GigE), probably yes.  But a fast (and expensive) network (e.g. 4xDDR Infiniband) can deliver data significantly faster than a local commodity hard drive.

If I'm missing other points, pro- or con-, I would appreciate hearing them.  Note again I'm not questioning the success of HDFS in achieving those stated design choices, but rather trying to understand HDFS's applicability to other storage domains beyond Hadoop.

Thanks for your time.



Re: HDFS without Hadoop: Why?

Posted by Nathan Rutman <nr...@gmail.com>.
On Mon, Jan 31, 2011 at 6:34 PM, Sean Bigdatafun
<se...@gmail.com> wrote:
> I feel this is a great discussion, so let's think of HDFS' customers.
> (1) MapReduce --- definitely a perfect fit as Nathan has pointed out
I would add the caveat that this depends on your particular weighting
factors of performance, ease of setup, hardware type, sysadmin
sophistication, failure scenarios, and total cost of ownership.  And
that cost is a non-linear function of scale. It's not true that HDFS
is always the best choice even for MapReduce.

> (2) HBase --- it seems HBase (Bigtable's log structured file) did a great
> job on this. The solution comes out of Google, it must be right.
I think this attitude is a major factor in why people choose HBase
(and HDFS).  But Google also sits at a particular point on the
many-dimensional factor space I alluded to above.  Best for Google
does not mean best for everyone.

> But would
> Google necessarily has chosen this approach in its Bigtable system should
> GFS did not exist in the first place? i.e, can we have alternative 'best'
> approach?
I bet you can guess my answer :)

> Anything else? I do not think HDFS is a good file system choice for
> enterprise applications.
Certainly not for most.

> On Tue, Jan 25, 2011 at 12:37 PM, Nathan Rutman <nr...@gmail.com> wrote:
>>
>> I have a very general question on the usefulness of HDFS for purposes
>> other than running distributed compute jobs for Hadoop.  Hadoop and HDFS
>> seem very popular these days, but the use of HDFS for other purposes
>> (database backend, records archiving, etc) confuses me, since there are
>> other free distributed filesystems out there (I personally work on Lustre),
>> with significantly better general-purpose performance.
>> So please tell me if I'm wrong about any of this.  Note I've gathered most
>> of my info from documentation rather than reading the source code.
>> As I understand it, HDFS was written specifically for Hadoop compute jobs,
>> with the following design factors in mind:
>>
>> write-once-read-many (worm) access model
>> use commodity hardware with relatively high failures rates (i.e.
>> assumptive failures)
>> long, sequential streaming data access
>> large files
>> hardware/OS agnostic
>> moving computation is cheaper than moving data
>>
>> While appropriate for processing many large-input Hadoop data-processing
>> jobs, there are significant penalties to be paid when trying to use these
>> design factors for more general-purpose storage:
>>
>> Commodity hardware requires data replication for safety.  The HDFS
>> implementation has three penalties: storage redundancy, network loading, and
>> blocking writes.  By default, HDFS blocks are replicated 3x: local,
>> "nearby", and "far away" to minimize the impact of data center catastrophe.
>>  In addition to the obvious 3x cost for storage, the result is that every
>> data block must be written "far away" - exactly the opposite of the "Move
>> Computation to Data" mantra.  Furthermore, these over-network writes are
>> synchronous; the client write blocks until all copies are complete on disk,
>> with the longest latency path of 2 network hops plus a disk write gating the
>> overall write speed.   Note that while this would be disastrous for a
>> general-purpose filesystem, with true WORM usage it may be acceptable to
>> penalize writes this way.
>
> Facebook seems to have a more cost effective way to do replication, but I am
> not sure about its MapReduce performance -- at the end of the day, there are
> only two 'proper' map slot machines that can host a 'cheap' mapper
> operation.
>
>>
>> Large block size implies fewer files.  HDFS reaches limits in the tens of
>> millions of files.
>> Large block size wastes space for small file.  The minimum file size is 1
>> block.
>> There is no data caching.  When delivering large contiguous streaming
>> data, this doesn't matter.  But when the read load is random, seeky, or
>> partial, this is a missing high-impact performance feature.
>
> Yes, can anyone answer this question? -- I want to ask the same question as
> well.

I talked to one of the principal HDFS designers, and he agreed with me
on all these points...

>>
>> In a WORM model, changing a small part of a file requires all the file
>> data to be copied, so e.g. database record modifications would be very
>> expensive.
>
> Yes, can anyone answer this question?
>>
>> There are no hardlinks, softlinks, or quotas.
... except that one.  HDFS now does softlinks and quotas.

>> HDFS isn't directly mountable, and therefore requires a non-standard API
>> to use.  (FUSE workaround exists.)
>> Java source code is very portable and easy to install, but not very quick.
>> Moving computation is cheaper than moving data.  But the data nonetheless
>> always has to be moved: either read off of a local hard drive or read over
>> the network into the compute node's memory.  It is not necessarily the case
>> that reading a local hard drive is faster than reading a distributed
>> (striped) file over a fast network.  Commodity network (e.g. 1GigE),
>> probably yes.  But a fast (and expensive) network (e.g. 4xDDR Infiniband)
>> can deliver data significantly faster than a local commodity hard drive.
>
> I agree with this statement: "It is not necessarily the case that reading a
> local hard drive is faster than reading a distributed (striped) file over a
> fast network. ", probably Infiniband as well as 10GigE network. And this is
> why I feel it might not be a good strategy that HBase entirely attach its
> design to HDFS.

I've proved this to my own satisfaction with a simple TestDFSIO
benchmark on HDFS and Lustre.  I posted the results in another thread
here.

>
>>
>> If I'm missing other points, pro- or con-, I would appreciate hearing
>> them.  Note again I'm not questioning the success of HDFS in achieving those
>> stated design choices, but rather trying to understand HDFS's applicability
>> to other storage domains beyond Hadoop.
>> Thanks for your time.

Re: HDFS without Hadoop: Why?

Posted by Ted Dunning <td...@maprtech.com>.
Speak for yourself!

With 12 local disks on a dual backplane, it is possible to sustain >1GB/s
read or write.  This is with fairly straightforward hardware.

That is pretty difficult to do with commodity networking (generally limited
to dual 1Gb NIC's for an aggregate of 200MB/s (if you are lucky)).

Moreover, bandwidth to locally cached data is vastly better than bandwidth
to remotely cached data.

I agree that network latency isn't a big factor next to disk latency, but
local bandwidth still wins.

On Tue, Feb 1, 2011 at 12:26 AM, tsuna <ts...@gmail.com> wrote:

> If you have a well designed network infrastructure, using local disks
> is not noticeably faster than using a remote disk.  The performance
> penalty of network hops is negligible compared to the performance
> penalty of going to disk.
>

Re: HDFS without Hadoop: Why?

Posted by tsuna <ts...@gmail.com>.
On Mon, Jan 31, 2011 at 6:34 PM, Sean Bigdatafun
<se...@gmail.com> wrote:
> On Tue, Jan 25, 2011 at 12:37 PM, Nathan Rutman <nr...@gmail.com> wrote:
>>    - Commodity hardware requires data replication for safety.  The HDFS
>>    implementation has three penalties: storage redundancy, network loading, and
>>    blocking writes.  By default, HDFS blocks are replicated 3x: local,
>>    "nearby", and "far away" to minimize the impact of data center catastrophe.
>>     In addition to the obvious 3x cost for storage, the result is that every
>>    data block must be written "far away" - exactly the opposite of the "Move
>>    Computation to Data" mantra.  Furthermore, these over-network writes are
>>    synchronous; the client write blocks until all copies are complete on disk,
>>    with the longest latency path of 2 network hops plus a disk write gating the
>>    overall write speed.   Note that while this would be disastrous for a
>>    general-purpose filesystem, with true WORM usage it may be acceptable to
>>    penalize writes this way.

If you have a well designed network infrastructure, using local disks
is not noticeably faster than using a remote disk.  The performance
penalty of network hops is negligible compared to the performance
penalty of going to disk.

>>    - There is no data caching.  When delivering large contiguous streaming
>>    data, this doesn't matter.  But when the read load is random, seeky, or
>>    partial, this is a missing high-impact performance feature.

The caching is provided by the OS.  The Linux page cache is doing a
much better job at caching and pre-fetching than a Hadoop DataNode can
ever dream of.

>>    - In a WORM model, changing a small part of a file requires all the
>>    file data to be copied, so e.g. database record modifications would be very
>>    expensive.
>>
> Yes, can anyone answer this question?

HBase doesn't modify files.  It only streams edits to a log, and
sometimes write full files at once.

>>    - Moving computation is cheaper than moving data.  But the data
>>    nonetheless always has to be moved: either read off of a local hard drive or
>>    read over the network into the compute node's memory.  It is not necessarily
>>    the case that reading a local hard drive is faster than reading a
>>    distributed (striped) file over a fast network.  Commodity network (e.g.
>>    1GigE), probably yes.  But a fast (and expensive) network (e.g. 4xDDR
>>    Infiniband) can deliver data significantly faster than a local commodity
>>    hard drive.
>>
> I agree with this statement: "It is not necessarily the case that reading a
> local hard drive is faster than reading a distributed (striped) file over a
> fast network. ", probably Infiniband as well as 10GigE network. And this is
> why I feel it might not be a good strategy that HBase entirely attach its
> design to HDFS.

You don't need Infiniband or 10GigE.  At StumbleUpon we have 1GbE
line-rate between every machine and the RTT latency across racks is
about 95µs.  It's far far more cost-effective than 10GbE or
Infiniband.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: HDFS without Hadoop: Why?

Posted by Sean Bigdatafun <se...@gmail.com>.
I feel this is a great discussion, so let's think of HDFS' customers.

(1) MapReduce --- definitely a perfect fit as Nathan has pointed out
*(2) HBase --- it seems HBase (Bigtable's log structured file) did a great
job on this. The solution comes out of Google, it must be right. But would
Google necessarily has chosen this approach in its Bigtable system should
GFS did not exist in the first place? i.e, can we have alternative 'best'
approach?*


Anything else? I do not think HDFS is a good file system choice for
enterprise applications.

On Tue, Jan 25, 2011 at 12:37 PM, Nathan Rutman <nr...@gmail.com> wrote:

> I have a very general question on the usefulness of HDFS for purposes other
> than running distributed compute jobs for Hadoop.  Hadoop and HDFS seem very
> popular these days, but the use of HDFS for other purposes (database
> backend, records archiving, etc) confuses me, since there are other free
> distributed filesystems out there (I personally work on Lustre), with
> significantly better general-purpose performance.
>
> So please tell me if I'm wrong about any of this.  Note I've gathered most
> of my info from documentation rather than reading the source code.
>
> As I understand it, HDFS was written specifically for Hadoop compute jobs,
> with the following design factors in mind:
>
>    - write-once-read-many (worm) access model
>    - use commodity hardware with relatively high failures rates (i.e.
>    assumptive failures)
>    - long, sequential streaming data access
>    - large files
>    - hardware/OS agnostic
>    - moving computation is cheaper than moving data
>
>
> While appropriate for processing many large-input Hadoop data-processing
> jobs, there are significant penalties to be paid when trying to use these
> design factors for more general-purpose storage:
>
>    - Commodity hardware requires data replication for safety.  The HDFS
>    implementation has three penalties: storage redundancy, network loading, and
>    blocking writes.  By default, HDFS blocks are replicated 3x: local,
>    "nearby", and "far away" to minimize the impact of data center catastrophe.
>     In addition to the obvious 3x cost for storage, the result is that every
>    data block must be written "far away" - exactly the opposite of the "Move
>    Computation to Data" mantra.  Furthermore, these over-network writes are
>    synchronous; the client write blocks until all copies are complete on disk,
>    with the longest latency path of 2 network hops plus a disk write gating the
>    overall write speed.   Note that while this would be disastrous for a
>    general-purpose filesystem, with true WORM usage it may be acceptable to
>    penalize writes this way.
>
> Facebook seems to have a more cost effective way to do replication, but I
am not sure about its MapReduce performance -- at the end of the day, there
are only two 'proper' map slot machines that can host a 'cheap' mapper
operation.


>
>    - Large block size implies fewer files.  HDFS reaches limits in the
>    tens of millions of files.
>    - Large block size wastes space for small file.  The minimum file size
>    is 1 block.
>    - There is no data caching.  When delivering large contiguous streaming
>    data, this doesn't matter.  But when the read load is random, seeky, or
>    partial, this is a missing high-impact performance feature.
>
> Yes, can anyone answer this question? -- I want to ask the same question as
well.

>
>    - In a WORM model, changing a small part of a file requires all the
>    file data to be copied, so e.g. database record modifications would be very
>    expensive.
>
> Yes, can anyone answer this question?

>
>    - There are no hardlinks, softlinks, or quotas.
>    - HDFS isn't directly mountable, and therefore requires a non-standard
>    API to use.  (FUSE workaround exists.)
>    - Java source code is very portable and easy to install, but not very
>    quick.
>    - Moving computation is cheaper than moving data.  But the data
>    nonetheless always has to be moved: either read off of a local hard drive or
>    read over the network into the compute node's memory.  It is not necessarily
>    the case that reading a local hard drive is faster than reading a
>    distributed (striped) file over a fast network.  Commodity network (e.g.
>    1GigE), probably yes.  But a fast (and expensive) network (e.g. 4xDDR
>    Infiniband) can deliver data significantly faster than a local commodity
>    hard drive.
>
> I agree with this statement: "It is not necessarily the case that reading a
local hard drive is faster than reading a distributed (striped) file over a
fast network. ", probably Infiniband as well as 10GigE network. And this is
why I feel it might not be a good strategy that HBase entirely attach its
design to HDFS.



> If I'm missing other points, pro- or con-, I would appreciate hearing them.
>  Note again I'm not questioning the success of HDFS in achieving those
> stated design choices, but rather trying to understand HDFS's applicability
> to other storage domains beyond Hadoop.
>
> Thanks for your time.
>
>


-- 
--Sean

Re: HDFS without Hadoop: Why?

Posted by Sean Bigdatafun <se...@gmail.com>.
I feel this is a great discussion, so let's think of HDFS' customers.

(1) MapReduce --- definitely a perfect fit as Nathan has pointed out
*(2) HBase --- it seems HBase (Bigtable's log structured file) did a great
job on this. The solution comes out of Google, it must be right. But would
Google necessarily has chosen this approach in its Bigtable system should
GFS did not exist in the first place? i.e, can we have alternative 'best'
approach?*


Anything else? I do not think HDFS is a good file system choice for
enterprise applications.

On Tue, Jan 25, 2011 at 12:37 PM, Nathan Rutman <nr...@gmail.com> wrote:

> I have a very general question on the usefulness of HDFS for purposes other
> than running distributed compute jobs for Hadoop.  Hadoop and HDFS seem very
> popular these days, but the use of HDFS for other purposes (database
> backend, records archiving, etc) confuses me, since there are other free
> distributed filesystems out there (I personally work on Lustre), with
> significantly better general-purpose performance.
>
> So please tell me if I'm wrong about any of this.  Note I've gathered most
> of my info from documentation rather than reading the source code.
>
> As I understand it, HDFS was written specifically for Hadoop compute jobs,
> with the following design factors in mind:
>
>    - write-once-read-many (worm) access model
>    - use commodity hardware with relatively high failures rates (i.e.
>    assumptive failures)
>    - long, sequential streaming data access
>    - large files
>    - hardware/OS agnostic
>    - moving computation is cheaper than moving data
>
>
> While appropriate for processing many large-input Hadoop data-processing
> jobs, there are significant penalties to be paid when trying to use these
> design factors for more general-purpose storage:
>
>    - Commodity hardware requires data replication for safety.  The HDFS
>    implementation has three penalties: storage redundancy, network loading, and
>    blocking writes.  By default, HDFS blocks are replicated 3x: local,
>    "nearby", and "far away" to minimize the impact of data center catastrophe.
>     In addition to the obvious 3x cost for storage, the result is that every
>    data block must be written "far away" - exactly the opposite of the "Move
>    Computation to Data" mantra.  Furthermore, these over-network writes are
>    synchronous; the client write blocks until all copies are complete on disk,
>    with the longest latency path of 2 network hops plus a disk write gating the
>    overall write speed.   Note that while this would be disastrous for a
>    general-purpose filesystem, with true WORM usage it may be acceptable to
>    penalize writes this way.
>
> Facebook seems to have a more cost effective way to do replication, but I
am not sure about its MapReduce performance -- at the end of the day, there
are only two 'proper' map slot machines that can host a 'cheap' mapper
operation.


>
>    - Large block size implies fewer files.  HDFS reaches limits in the
>    tens of millions of files.
>    - Large block size wastes space for small file.  The minimum file size
>    is 1 block.
>    - There is no data caching.  When delivering large contiguous streaming
>    data, this doesn't matter.  But when the read load is random, seeky, or
>    partial, this is a missing high-impact performance feature.
>
> Yes, can anyone answer this question? -- I want to ask the same question as
well.

>
>    - In a WORM model, changing a small part of a file requires all the
>    file data to be copied, so e.g. database record modifications would be very
>    expensive.
>
> Yes, can anyone answer this question?

>
>    - There are no hardlinks, softlinks, or quotas.
>    - HDFS isn't directly mountable, and therefore requires a non-standard
>    API to use.  (FUSE workaround exists.)
>    - Java source code is very portable and easy to install, but not very
>    quick.
>    - Moving computation is cheaper than moving data.  But the data
>    nonetheless always has to be moved: either read off of a local hard drive or
>    read over the network into the compute node's memory.  It is not necessarily
>    the case that reading a local hard drive is faster than reading a
>    distributed (striped) file over a fast network.  Commodity network (e.g.
>    1GigE), probably yes.  But a fast (and expensive) network (e.g. 4xDDR
>    Infiniband) can deliver data significantly faster than a local commodity
>    hard drive.
>
> I agree with this statement: "It is not necessarily the case that reading a
local hard drive is faster than reading a distributed (striped) file over a
fast network. ", probably Infiniband as well as 10GigE network. And this is
why I feel it might not be a good strategy that HBase entirely attach its
design to HDFS.



> If I'm missing other points, pro- or con-, I would appreciate hearing them.
>  Note again I'm not questioning the success of HDFS in achieving those
> stated design choices, but rather trying to understand HDFS's applicability
> to other storage domains beyond Hadoop.
>
> Thanks for your time.
>
>


-- 
--Sean