You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Robert Krüger <kr...@signal7.de> on 2008/03/22 18:32:31 UTC

Hadoop HDFS vs. NFS latency

Hi,

we're currently evaluating the use of Hadoop's HDFS for a project where 
most of the data will be large files and latency will not matter that 
much, so it should be suited perfectly in those cases, but some of the 
data will be smaller files (a few K) and will be accessed quite 
frequently (e.g. by an HTTP server serving user requests, however not 
with a very high load). Now, while the docs state that thruput rather 
than latency was optimized, I am curious as to what order of magnitude 
of an overhead is to be expected for, e.g. reading a file's metadata 
such as modification timestamp or streaming (reading) a small file as a 
whole as compare to things like NFS access or access to theses resources 
via another HTTP server using the same network hardware and setup? 
Paying a 10% penalty in latency compared to NFS would probably be OK but 
at the moment I have no idea if we are talking abou 1, 10 or 100% 
difference.

Has anyone done similar things, i.e. serve not so large files directly 
from a HDFS cluster via an HTTP server?

Thanks in advance,

Robert

Re: Hadoop HDFS vs. NFS latency

Posted by Robert Krüger <kr...@signal7.de>.

Ted Dunning wrote:
> A few million files should fit pretty easily in hdfs.
> 
> One problem is that hadoop is not designed with full high availability in
> mind.  Mogile is easier to adapt to those needs.
Sorry to be so persistent but what failure scenario would mogile handle 
better than hadoop hdfs or the other way around, which potential 
failures would one have to live with using HDFS compared to mogile?

> 
> Latency in our case is much less critical than for many applications since
> we are talking about web service and are only the origin point for a content
> delivery network (or three).
That is probably the case in our app as well, i.e. all potentially 
high-volume accesses to resources backed by HDFS will go through a CDN, 
so if we can expect HDFS latency to be in the same order of magnitude as 
getting the files from a local HTTP server (using the same hardware) we 
should be OK. Any idea if that is the case?

Thanks again!

Robert

> 
> 
> On 3/24/08 2:44 AM, "Robert Krüger" <kr...@signal7.de> wrote:
> 
>> Thank you very much for sharing your experience! This is very helpful
>> and we will take a loog at mogile.
>>
>> I have two questions regarding your decision against HDFS. You mention
>> issues with scale regarding the number of files. Could you elaborate a
>> bit? At which orders of magnitude would you expect problems with HDFS
>> and which? Our system will at max have a few million files, so that's at
>> least two orders of magnitude less than in your case.
>>
>> Did you look at latency of file access at all? Did you run into any
>> issues there?
>>
>> Thanks,
>>
>> Robert
>>
>>
>> Ted Dunning wrote:
>>> We at Veoh face a similar problem of serving files from a distributed store,
>>> probably on somewhat larger scale than your are looking at.
>>>
>>> We evaluated several systems and determined that for our needs, hadoop was
>>> not a good choice.  The advantages of hadoop (map-reduce, blocking) were
>>> substantially out-weighed by issues with scale (we have nearly a billion
>>> files that we need to work with) and hackability (we need to have system
>>> admins be able to modify placement policies).  We also were unable to find
>>> anybody else using hadoop in this way.
>>>
>>> Commercial distributed stores were eliminated based on cost alone.
>>>
>>> In the end, we chose to use mogile for origin file service and have been
>>> pretty happy with the choice, although we did wind up making significant
>>> changes to the system in order to facilitate scaling.  One major advantage
>>> for the choice of mogile was the fact that we knew that it was already being
>>> used in some high volume sites.
>>>
>>>
>>> On 3/22/08 10:32 AM, "Robert Krüger" <kr...@signal7.de> wrote:
>>>
>>>> Hi,
>>>>
>>>> we're currently evaluating the use of Hadoop's HDFS for a project where
>>>> most of the data will be large files and latency will not matter that
>>>> much, so it should be suited perfectly in those cases, but some of the
>>>> data will be smaller files (a few K) and will be accessed quite
>>>> frequently (e.g. by an HTTP server serving user requests, however not
>>>> with a very high load). Now, while the docs state that thruput rather
>>>> than latency was optimized, I am curious as to what order of magnitude
>>>> of an overhead is to be expected for, e.g. reading a file's metadata
>>>> such as modification timestamp or streaming (reading) a small file as a
>>>> whole as compare to things like NFS access or access to theses resources
>>>> via another HTTP server using the same network hardware and setup?
>>>> Paying a 10% penalty in latency compared to NFS would probably be OK but
>>>> at the moment I have no idea if we are talking abou 1, 10 or 100%
>>>> difference.
>>>>
>>>> Has anyone done similar things, i.e. serve not so large files directly
>>>> from a HDFS cluster via an HTTP server?
>>>>
>>>> Thanks in advance,
>>>>
>>>> Robert
>>>>
> 


-- 
(-) Robert Krüger
(-) SIGNAL 7 Gesellschaft für Informationstechnologie mbH
(-) Landwehrstraße 4 - 64293 Darmstadt,
(-) Tel: +49 (0) 6151 969 96 11, Fax: +49 (0) 6151 969 96 29
(-) krueger@signal7.de, www.signal7.de
(-) Amtsgericht Darmstadt, HRB 6833
(-) Geschäftsführer: Robert Krüger, Frank Peters, Jochen Strunk

Re: Hadoop HDFS vs. NFS latency

Posted by Ted Dunning <td...@veoh.com>.

A few million files should fit pretty easily in hdfs.

One problem is that hadoop is not designed with full high availability in
mind.  Mogile is easier to adapt to those needs.

Latency in our case is much less critical than for many applications since
we are talking about web service and are only the origin point for a content
delivery network (or three).


On 3/24/08 2:44 AM, "Robert Krüger" <kr...@signal7.de> wrote:

> 
> Thank you very much for sharing your experience! This is very helpful
> and we will take a loog at mogile.
> 
> I have two questions regarding your decision against HDFS. You mention
> issues with scale regarding the number of files. Could you elaborate a
> bit? At which orders of magnitude would you expect problems with HDFS
> and which? Our system will at max have a few million files, so that's at
> least two orders of magnitude less than in your case.
> 
> Did you look at latency of file access at all? Did you run into any
> issues there?
> 
> Thanks,
> 
> Robert
> 
> 
> Ted Dunning wrote:
>> We at Veoh face a similar problem of serving files from a distributed store,
>> probably on somewhat larger scale than your are looking at.
>> 
>> We evaluated several systems and determined that for our needs, hadoop was
>> not a good choice.  The advantages of hadoop (map-reduce, blocking) were
>> substantially out-weighed by issues with scale (we have nearly a billion
>> files that we need to work with) and hackability (we need to have system
>> admins be able to modify placement policies).  We also were unable to find
>> anybody else using hadoop in this way.
>> 
>> Commercial distributed stores were eliminated based on cost alone.
>> 
>> In the end, we chose to use mogile for origin file service and have been
>> pretty happy with the choice, although we did wind up making significant
>> changes to the system in order to facilitate scaling.  One major advantage
>> for the choice of mogile was the fact that we knew that it was already being
>> used in some high volume sites.
>> 
>> 
>> On 3/22/08 10:32 AM, "Robert Krüger" <kr...@signal7.de> wrote:
>> 
>>> Hi,
>>> 
>>> we're currently evaluating the use of Hadoop's HDFS for a project where
>>> most of the data will be large files and latency will not matter that
>>> much, so it should be suited perfectly in those cases, but some of the
>>> data will be smaller files (a few K) and will be accessed quite
>>> frequently (e.g. by an HTTP server serving user requests, however not
>>> with a very high load). Now, while the docs state that thruput rather
>>> than latency was optimized, I am curious as to what order of magnitude
>>> of an overhead is to be expected for, e.g. reading a file's metadata
>>> such as modification timestamp or streaming (reading) a small file as a
>>> whole as compare to things like NFS access or access to theses resources
>>> via another HTTP server using the same network hardware and setup?
>>> Paying a 10% penalty in latency compared to NFS would probably be OK but
>>> at the moment I have no idea if we are talking abou 1, 10 or 100%
>>> difference.
>>> 
>>> Has anyone done similar things, i.e. serve not so large files directly
>>> from a HDFS cluster via an HTTP server?
>>> 
>>> Thanks in advance,
>>> 
>>> Robert
>>> 
>> 
>

Re: Hadoop HDFS vs. NFS latency

Posted by Robert Krüger <kr...@signal7.de>.

Thank you very much for sharing your experience! This is very helpful 
and we will take a loog at mogile.

I have two questions regarding your decision against HDFS. You mention 
issues with scale regarding the number of files. Could you elaborate a 
bit? At which orders of magnitude would you expect problems with HDFS 
and which? Our system will at max have a few million files, so that's at 
least two orders of magnitude less than in your case.

Did you look at latency of file access at all? Did you run into any 
issues there?

Thanks,

Robert


Ted Dunning wrote:
> We at Veoh face a similar problem of serving files from a distributed store,
> probably on somewhat larger scale than your are looking at.
> 
> We evaluated several systems and determined that for our needs, hadoop was
> not a good choice.  The advantages of hadoop (map-reduce, blocking) were
> substantially out-weighed by issues with scale (we have nearly a billion
> files that we need to work with) and hackability (we need to have system
> admins be able to modify placement policies).  We also were unable to find
> anybody else using hadoop in this way.
> 
> Commercial distributed stores were eliminated based on cost alone.
> 
> In the end, we chose to use mogile for origin file service and have been
> pretty happy with the choice, although we did wind up making significant
> changes to the system in order to facilitate scaling.  One major advantage
> for the choice of mogile was the fact that we knew that it was already being
> used in some high volume sites.
> 
> 
> On 3/22/08 10:32 AM, "Robert Krüger" <kr...@signal7.de> wrote:
> 
>> Hi,
>>
>> we're currently evaluating the use of Hadoop's HDFS for a project where
>> most of the data will be large files and latency will not matter that
>> much, so it should be suited perfectly in those cases, but some of the
>> data will be smaller files (a few K) and will be accessed quite
>> frequently (e.g. by an HTTP server serving user requests, however not
>> with a very high load). Now, while the docs state that thruput rather
>> than latency was optimized, I am curious as to what order of magnitude
>> of an overhead is to be expected for, e.g. reading a file's metadata
>> such as modification timestamp or streaming (reading) a small file as a
>> whole as compare to things like NFS access or access to theses resources
>> via another HTTP server using the same network hardware and setup?
>> Paying a 10% penalty in latency compared to NFS would probably be OK but
>> at the moment I have no idea if we are talking abou 1, 10 or 100%
>> difference.
>>
>> Has anyone done similar things, i.e. serve not so large files directly
>> from a HDFS cluster via an HTTP server?
>>
>> Thanks in advance,
>>
>> Robert
>>
> 


-- 
(-) Robert Krüger
(-) SIGNAL 7 Gesellschaft für Informationstechnologie mbH
(-) Landwehrstraße 4 - 64293 Darmstadt,
(-) Tel: +49 (0) 6151 969 96 11, Fax: +49 (0) 6151 969 96 29
(-) krueger@signal7.de, www.signal7.de
(-) Amtsgericht Darmstadt, HRB 6833
(-) Geschäftsführer: Robert Krüger, Frank Peters, Jochen Strunk

Re: Hadoop HDFS vs. NFS latency

Posted by Ted Dunning <td...@veoh.com>.

We at Veoh face a similar problem of serving files from a distributed store,
probably on somewhat larger scale than your are looking at.

We evaluated several systems and determined that for our needs, hadoop was
not a good choice.  The advantages of hadoop (map-reduce, blocking) were
substantially out-weighed by issues with scale (we have nearly a billion
files that we need to work with) and hackability (we need to have system
admins be able to modify placement policies).  We also were unable to find
anybody else using hadoop in this way.

Commercial distributed stores were eliminated based on cost alone.

In the end, we chose to use mogile for origin file service and have been
pretty happy with the choice, although we did wind up making significant
changes to the system in order to facilitate scaling.  One major advantage
for the choice of mogile was the fact that we knew that it was already being
used in some high volume sites.

On 3/22/08 10:32 AM, "Robert Krüger" <kr...@signal7.de> wrote:

> 
> Hi,
> 
> we're currently evaluating the use of Hadoop's HDFS for a project where
> most of the data will be large files and latency will not matter that
> much, so it should be suited perfectly in those cases, but some of the
> data will be smaller files (a few K) and will be accessed quite
> frequently (e.g. by an HTTP server serving user requests, however not
> with a very high load). Now, while the docs state that thruput rather
> than latency was optimized, I am curious as to what order of magnitude
> of an overhead is to be expected for, e.g. reading a file's metadata
> such as modification timestamp or streaming (reading) a small file as a
> whole as compare to things like NFS access or access to theses resources
> via another HTTP server using the same network hardware and setup?
> Paying a 10% penalty in latency compared to NFS would probably be OK but
> at the moment I have no idea if we are talking abou 1, 10 or 100%
> difference.
> 
> Has anyone done similar things, i.e. serve not so large files directly
> from a HDFS cluster via an HTTP server?
> 
> Thanks in advance,
> 
> Robert
>