You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Per Steffensen <st...@designware.dk> on 2011/09/14 20:02:14 UTC

HDFS vs software RAID like md(adm)

Hi

If my goal is to have multiple physical disks seem as one big disk with 
redundancy built in, why would I use a HDFS cluster among machines with 
one disk each, instead of using software RAID like md(adm) directly on 
top of the disks? I am looking for pros and cons on the two solutions.
http://en.wikipedia.org/wiki/RAID#Software-based_RAID
http://en.wikipedia.org/wiki/Mdadm

Regards, Per Steffensen

Re: HDFS vs software RAID like md(adm)

Posted by Kanghua151 <ka...@msn.com>.
hi Masters:
      i want to develop a log structure filesystem based on hdfs。this filesystem used to host virtualizaion machine image file 。
      on hdfs i can implement snapshot and data redundancy;as log structure fs ,which support random access。
i also hope to use map reduce way to do segment cleaning partly。
     i am not sure if it is reasonable 。i really hope to build a env that support online app and offline app at same time。i am trying do it on hdfs。can you give me some advice?
 thanks
kanghua
发自我的 iPhone

在 2011-9-15,14:54,Norman Maurer <no...@googlemail.com> 写道:

> You should keep in mind that HDFS is not POSIX conform so you will
> have a hard time to use it as "real fs". I know there is a fuse driver
> for it but I would not use it for heavy usage. Also HDFS is not really
> a good fit for random access at all.
> 
> If you really need a POSIX fs I would recomment you to have a look at
> DRBD or glusterfs..
> 
> Bye,
> Norman
> 
> 
> 2011/9/15 Per Steffensen <st...@designware.dk>:
>> David Rosenstrauch skrev:
>>> 
>>> On 09/14/2011 02:02 PM, Per Steffensen wrote:
>>>> 
>>>> Hi
>>>> 
>>>> If my goal is to have multiple physical disks seem as one big disk with
>>>> redundancy built in, why would I use a HDFS cluster among machines with
>>>> one disk each, instead of using software RAID like md(adm) directly on
>>>> top of the disks? I am looking for pros and cons on the two solutions.
>>>> http://en.wikipedia.org/wiki/RAID#Software-based_RAID
>>>> http://en.wikipedia.org/wiki/Mdadm
>>>> 
>>>> Regards, Per Steffensen
>>> 
>>> HDFS was never intended to be a general-purpose file system.  It is a
>>> system optimized for a) running map/reduce, and b) holding large files.  It
>>> should not be considered as a replacement for RAID.
>>> 
>>> DR
>> 
>> Thanks for you reply, David. Despite that HDFS wasnt intended to be used for
>> this, I guess it could be. So if we forget for a moment that it was not
>> designed/optimized to be used as a general purpose file system (GPFS), what
>> are the pros and cons for using it as a GPFS with built in redundancy vs
>> using software RAID. Is HDFS too slow for some kind of file operations, or
>> what will the problems (and benefits) be? Hope for some input - I need
>> arguments for and against to be used in a discussion with a customer.
>> Thanks!
>>> 
>>> 
>> 
>> 
> 

Re: HDFS vs software RAID like md(adm)

Posted by Per Steffensen <st...@designware.dk>.
Norman Maurer skrev:
> You should keep in mind that HDFS is not POSIX conform so you will
> have a hard time to use it as "real fs". I know there is a fuse driver
>   
Guess there is a few solutions http://wiki.apache.org/hadoop/MountableHDFS
An alternative would be to write the file-accessing code directly 
against the HDFS filesystem og perhaps against another VFS 
(http://en.wikipedia.org/wiki/Virtual_file_system), than what mounting 
gives us through the FUSE VFS 
(http://en.wikipedia.org/wiki/Filesystem_in_Userspace) - of course a VFS 
that has a port to HDFS (e.g. this 
(https://issues.apache.org/jira/browse/HDFS-1213) port to the Apache 
Commons VFS (http://commons.apache.org/vfs/))
> for it but I would not use it for heavy usage.
Ok, thanks. It will be used for heavy usage. A good cons.
>  Also HDFS is not really
> a good fit for random access at all.
>   
Also a good cons.
> If you really need a POSIX fs I would recomment you to have a look at
> DRBD or glusterfs..
>   
Thanks. I will have a look at those.
> Bye,
> Norman
>
>
> 2011/9/15 Per Steffensen <st...@designware.dk>:
>   
>> David Rosenstrauch skrev:
>>     
>>> On 09/14/2011 02:02 PM, Per Steffensen wrote:
>>>       
>>>> Hi
>>>>
>>>> If my goal is to have multiple physical disks seem as one big disk with
>>>> redundancy built in, why would I use a HDFS cluster among machines with
>>>> one disk each, instead of using software RAID like md(adm) directly on
>>>> top of the disks? I am looking for pros and cons on the two solutions.
>>>> http://en.wikipedia.org/wiki/RAID#Software-based_RAID
>>>> http://en.wikipedia.org/wiki/Mdadm
>>>>
>>>> Regards, Per Steffensen
>>>>         
>>> HDFS was never intended to be a general-purpose file system.  It is a
>>> system optimized for a) running map/reduce, and b) holding large files.  It
>>> should not be considered as a replacement for RAID.
>>>
>>> DR
>>>       
>> Thanks for you reply, David. Despite that HDFS wasnt intended to be used for
>> this, I guess it could be. So if we forget for a moment that it was not
>> designed/optimized to be used as a general purpose file system (GPFS), what
>> are the pros and cons for using it as a GPFS with built in redundancy vs
>> using software RAID. Is HDFS too slow for some kind of file operations, or
>> what will the problems (and benefits) be? Hope for some input - I need
>> arguments for and against to be used in a discussion with a customer.
>> Thanks!
>>     
>>>       
>>     
>
>
>   


Re: HDFS vs software RAID like md(adm)

Posted by Norman Maurer <no...@googlemail.com>.
You should keep in mind that HDFS is not POSIX conform so you will
have a hard time to use it as "real fs". I know there is a fuse driver
for it but I would not use it for heavy usage. Also HDFS is not really
a good fit for random access at all.

If you really need a POSIX fs I would recomment you to have a look at
DRBD or glusterfs..

Bye,
Norman


2011/9/15 Per Steffensen <st...@designware.dk>:
> David Rosenstrauch skrev:
>>
>> On 09/14/2011 02:02 PM, Per Steffensen wrote:
>>>
>>> Hi
>>>
>>> If my goal is to have multiple physical disks seem as one big disk with
>>> redundancy built in, why would I use a HDFS cluster among machines with
>>> one disk each, instead of using software RAID like md(adm) directly on
>>> top of the disks? I am looking for pros and cons on the two solutions.
>>> http://en.wikipedia.org/wiki/RAID#Software-based_RAID
>>> http://en.wikipedia.org/wiki/Mdadm
>>>
>>> Regards, Per Steffensen
>>
>> HDFS was never intended to be a general-purpose file system.  It is a
>> system optimized for a) running map/reduce, and b) holding large files.  It
>> should not be considered as a replacement for RAID.
>>
>> DR
>
> Thanks for you reply, David. Despite that HDFS wasnt intended to be used for
> this, I guess it could be. So if we forget for a moment that it was not
> designed/optimized to be used as a general purpose file system (GPFS), what
> are the pros and cons for using it as a GPFS with built in redundancy vs
> using software RAID. Is HDFS too slow for some kind of file operations, or
> what will the problems (and benefits) be? Hope for some input - I need
> arguments for and against to be used in a discussion with a customer.
> Thanks!
>>
>>
>
>

Re: HDFS vs software RAID like md(adm)

Posted by Per Steffensen <st...@designware.dk>.
David Rosenstrauch skrev:
> On 09/14/2011 02:02 PM, Per Steffensen wrote:
>> Hi
>>
>> If my goal is to have multiple physical disks seem as one big disk with
>> redundancy built in, why would I use a HDFS cluster among machines with
>> one disk each, instead of using software RAID like md(adm) directly on
>> top of the disks? I am looking for pros and cons on the two solutions.
>> http://en.wikipedia.org/wiki/RAID#Software-based_RAID
>> http://en.wikipedia.org/wiki/Mdadm
>>
>> Regards, Per Steffensen
>
> HDFS was never intended to be a general-purpose file system.  It is a 
> system optimized for a) running map/reduce, and b) holding large 
> files.  It should not be considered as a replacement for RAID.
>
> DR
Thanks for you reply, David. Despite that HDFS wasnt intended to be used 
for this, I guess it could be. So if we forget for a moment that it was 
not designed/optimized to be used as a general purpose file system 
(GPFS), what are the pros and cons for using it as a GPFS with built in 
redundancy vs using software RAID. Is HDFS too slow for some kind of 
file operations, or what will the problems (and benefits) be? Hope for 
some input - I need arguments for and against to be used in a discussion 
with a customer. Thanks!
>
>


Re: HDFS vs software RAID like md(adm)

Posted by David Rosenstrauch <da...@darose.net>.
On 09/14/2011 02:02 PM, Per Steffensen wrote:
> Hi
>
> If my goal is to have multiple physical disks seem as one big disk with
> redundancy built in, why would I use a HDFS cluster among machines with
> one disk each, instead of using software RAID like md(adm) directly on
> top of the disks? I am looking for pros and cons on the two solutions.
> http://en.wikipedia.org/wiki/RAID#Software-based_RAID
> http://en.wikipedia.org/wiki/Mdadm
>
> Regards, Per Steffensen

HDFS was never intended to be a general-purpose file system.  It is a 
system optimized for a) running map/reduce, and b) holding large files. 
  It should not be considered as a replacement for RAID.

DR

Re: HDFS vs software RAID like md(adm)

Posted by Віталій Тимчишин <ti...@gmail.com>.
Main con for me is that all the metadata is kept in ram of single node, so
if you have a lot of files, you need a lot of ram on main (name) node. This
limits scalability.

Another thing is that it does not like a lot of directories. It starts
checking all the directories now and then, locking data node in the
meantime. For me this knocked out nodes from the cluster, so I had to make a
patch limiting check depth.

Also remember, you can only write a file as a whole, no updates. Appends are
also not available in stable apache releases.
14.09.2011 21:03 пользователь "Per Steffensen" <st...@designware.dk>
написал: