You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Monchanin Eric <em...@skycore.com> on 2008/09/12 11:48:52 UTC

HDFS

Hello to all,

I have been attracted by the Hadoop project while looking for a solution
for my application.
Basically, I have an application hosting user generated content (images,
sounds, videos) and I would like to have this available at all time for
all my servers.
Servers will basically add new content, user can manipulate the existing
content, make compositions etc etc ...

We have a few servers (2 for now) dedicated to hosting content, and
right now, they are connected via sshfs on some folders, in order to
shorten the transfert time between these content servers and the
application servers.

Would the Hadoop filesystem be usefull in my case, is it worth digging
into it.

In the case it is doable, how redundant the system is ? for instance, to
store 1 MB of data, how much storage do I need (I guess at least 2 MB ...) ?

I hope I made myself clear enough and will get encouraging answers,

Bests to all,

Eric


RE: HDFS

Posted by Dmitry Pushkarev <um...@stanford.edu>.
Why not use HAR over HDFS? Idea being that if you don't do too much of
writing, having files compacter to har archives (that will be stored in 64mb
slices) might be a good answer. 

Thus the question for hadoop developers, is hadoop har-aware? 
In two senses:
1. Whether it tries to assign tasks close to data (know to which piece of
har file a given file belong to)
2. If har folder is fed to a map task, will hadoop give all files to the
task altogether through local datanode, or getting each file will result in
accessing namenode, resolving, and etc?

-----Original Message-----
From: Monchanin Eric [mailto:emonchanin@skycore.com] 
Sent: Saturday, September 13, 2008 4:59 PM
To: core-user@hadoop.apache.org
Subject: Re: HDFS

Hi,

Thank you all for your answers, I am running deep into various
distributed file systems.
So far checking on mogilefs, kfs ... which seem to better fit my needs.
I'll be dealing indeed with thousands of small files (few kB to few MB,
10 MB top).
Our service is based on a web portal driven by php.
Though I'm having hard times compiling any of them (mogilefs with perl
and dependancies, kfs with c++ and dependancies).

If any one of you has any experience to share, I'll be glad to listen.

Cheers,

Eric

Robert Krüger a écrit :
> Hi Eric,
>
> we are currently building a system for a very similar purpose (digital
> asset management) and we use HDFS currently for a volume of approx.
> 100TB with the option to scale into the PB range. Since we haven't gone
> into production yet, I cannot say it will work flawlessly but so far
> everything has worked very well with really good performance (especially
> read performance which is probably also in your case the most important
> factor). The most important thing you have to be aware of IMHO ist that
> you will not have a real file system on the OS level. If you use tools
> which need that to process the data you will need to do some copying
> (which we do in some cases). There is a project out there that makes
> HDFS available via FUSE but it appears to be rather alpha which is why
> we haven't dared to take a look at it for this project.
>
> Apart from the namenode, which you have to get redundant yourself (lots
> of posts in the archives on this topic) you can simply configure the
> level of redundancy (see docs).
>
> Hope this helps,
>
> Robert
>
>
> Monchanin Eric wrote:
>   
>> Hello to all,
>>
>> I have been attracted by the Hadoop project while looking for a solution
>> for my application.
>> Basically, I have an application hosting user generated content (images,
>> sounds, videos) and I would like to have this available at all time for
>> all my servers.
>> Servers will basically add new content, user can manipulate the existing
>> content, make compositions etc etc ...
>>
>> We have a few servers (2 for now) dedicated to hosting content, and
>> right now, they are connected via sshfs on some folders, in order to
>> shorten the transfert time between these content servers and the
>> application servers.
>>
>> Would the Hadoop filesystem be usefull in my case, is it worth digging
>> into it.
>>
>> In the case it is doable, how redundant the system is ? for instance, to
>> store 1 MB of data, how much storage do I need (I guess at least 2 MB
...) ?
>>
>> I hope I made myself clear enough and will get encouraging answers,
>>
>> Bests to all,
>>
>> Eric
>>
>>     
>
>
>   



Re: HDFS

Posted by Monchanin Eric <em...@skycore.com>.
Hi,

Thank you all for your answers, I am running deep into various
distributed file systems.
So far checking on mogilefs, kfs ... which seem to better fit my needs.
I'll be dealing indeed with thousands of small files (few kB to few MB,
10 MB top).
Our service is based on a web portal driven by php.
Though I'm having hard times compiling any of them (mogilefs with perl
and dependancies, kfs with c++ and dependancies).

If any one of you has any experience to share, I'll be glad to listen.

Cheers,

Eric

Robert Krüger a écrit :
> Hi Eric,
>
> we are currently building a system for a very similar purpose (digital
> asset management) and we use HDFS currently for a volume of approx.
> 100TB with the option to scale into the PB range. Since we haven't gone
> into production yet, I cannot say it will work flawlessly but so far
> everything has worked very well with really good performance (especially
> read performance which is probably also in your case the most important
> factor). The most important thing you have to be aware of IMHO ist that
> you will not have a real file system on the OS level. If you use tools
> which need that to process the data you will need to do some copying
> (which we do in some cases). There is a project out there that makes
> HDFS available via FUSE but it appears to be rather alpha which is why
> we haven't dared to take a look at it for this project.
>
> Apart from the namenode, which you have to get redundant yourself (lots
> of posts in the archives on this topic) you can simply configure the
> level of redundancy (see docs).
>
> Hope this helps,
>
> Robert
>
>
> Monchanin Eric wrote:
>   
>> Hello to all,
>>
>> I have been attracted by the Hadoop project while looking for a solution
>> for my application.
>> Basically, I have an application hosting user generated content (images,
>> sounds, videos) and I would like to have this available at all time for
>> all my servers.
>> Servers will basically add new content, user can manipulate the existing
>> content, make compositions etc etc ...
>>
>> We have a few servers (2 for now) dedicated to hosting content, and
>> right now, they are connected via sshfs on some folders, in order to
>> shorten the transfert time between these content servers and the
>> application servers.
>>
>> Would the Hadoop filesystem be usefull in my case, is it worth digging
>> into it.
>>
>> In the case it is doable, how redundant the system is ? for instance, to
>> store 1 MB of data, how much storage do I need (I guess at least 2 MB ...) ?
>>
>> I hope I made myself clear enough and will get encouraging answers,
>>
>> Bests to all,
>>
>> Eric
>>
>>     
>
>
>   



Re: HDFS

Posted by James Moore <ja...@gmail.com>.
On Fri, Sep 12, 2008 at 3:08 AM, Robert Krüger <kr...@signal7.de> wrote:
> we are currently building a system for a very similar purpose (digital
> asset management) and we use HDFS currently for a volume of approx.
> 100TB with the option to scale into the PB range.

Robert, would you mind expanding on why you picked HDFS over something
like GFS or MogileFS?  I would have agreed with Mikhail - HDFS seems
like it's purpose-built for Hadoop, and wouldn't necessarily be the
best choice if you just wanted a filesystem.

-- 
James Moore | james@restphone.com
Ruby and Ruby on Rails consulting
blog.restphone.com

Re: HDFS

Posted by Robert Krüger <kr...@signal7.de>.
Hi Eric,

we are currently building a system for a very similar purpose (digital
asset management) and we use HDFS currently for a volume of approx.
100TB with the option to scale into the PB range. Since we haven't gone
into production yet, I cannot say it will work flawlessly but so far
everything has worked very well with really good performance (especially
read performance which is probably also in your case the most important
factor). The most important thing you have to be aware of IMHO ist that
you will not have a real file system on the OS level. If you use tools
which need that to process the data you will need to do some copying
(which we do in some cases). There is a project out there that makes
HDFS available via FUSE but it appears to be rather alpha which is why
we haven't dared to take a look at it for this project.

Apart from the namenode, which you have to get redundant yourself (lots
of posts in the archives on this topic) you can simply configure the
level of redundancy (see docs).

Hope this helps,

Robert


Monchanin Eric wrote:
> Hello to all,
> 
> I have been attracted by the Hadoop project while looking for a solution
> for my application.
> Basically, I have an application hosting user generated content (images,
> sounds, videos) and I would like to have this available at all time for
> all my servers.
> Servers will basically add new content, user can manipulate the existing
> content, make compositions etc etc ...
> 
> We have a few servers (2 for now) dedicated to hosting content, and
> right now, they are connected via sshfs on some folders, in order to
> shorten the transfert time between these content servers and the
> application servers.
> 
> Would the Hadoop filesystem be usefull in my case, is it worth digging
> into it.
> 
> In the case it is doable, how redundant the system is ? for instance, to
> store 1 MB of data, how much storage do I need (I guess at least 2 MB ...) ?
> 
> I hope I made myself clear enough and will get encouraging answers,
> 
> Bests to all,
> 
> Eric
> 


Re: HDFS

Posted by Mikhail Yakshin <gr...@gmail.com>.
Hi,

> I have been attracted by the Hadoop project while looking for a solution
> for my application.
> Basically, I have an application hosting user generated content (images,
> sounds, videos) and I would like to have this available at all time for
> all my servers.
> Servers will basically add new content, user can manipulate the existing
> content, make compositions etc etc ...
>
> We have a few servers (2 for now) dedicated to hosting content, and
> right now, they are connected via sshfs on some folders, in order to
> shorten the transfert time between these content servers and the
> application servers.
>
> Would the Hadoop filesystem be usefull in my case, is it worth digging
> into it.

I guess not, your best choice would be something like MogileFS. HDFS
is a filesystem optimized for distributed calculations, and thus it
works best with big files (comparable to the size of block, like
64MB). Hosting a lots of smaller files would be an overkill.

-- 
WBR, Mikhail Yakshin