You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Dennis Kubes <ku...@apache.org> on 2008/11/26 13:12:05 UTC
Namenode BlocksMap on Disk
From time to time a message pops up on the mailing list about OOM
errors for the namenode because of too many files. Most recently there
was a 1.7 million file installation that was failing. I know the simple
solution to this is to have a larger java heap for the namenode. But
the non-simple way would be to convert the BlocksMap for the NameNode to
be stored on disk and then queried and updated for operations. This
would eliminate memory problems for large file installations but also
might degrade performance slightly. Questions:
1) Is there any current work to allow the namenode to store on disk
versus is memory? This could be a configurable option.
2) Besides possible slight degradation in performance, is there a reason
why the BlocksMap shouldn't or couldn't be stored on disk?
I am willing to put forth the work to make this happen. Just want to
make sure I am not going down the wrong path to begin with.
Dennis
Re: Namenode BlocksMap on Disk
Posted by Raghu Angadi <ra...@yahoo-inc.com>.
Dennis Kubes wrote:
>
> From time to time a message pops up on the mailing list about OOM
> errors for the namenode because of too many files. Most recently there
> was a 1.7 million file installation that was failing. I know the simple
> solution to this is to have a larger java heap for the namenode. But
> the non-simple way would be to convert the BlocksMap for the NameNode to
> be stored on disk and then queried and updated for operations. This
> would eliminate memory problems for large file installations but also
> might degrade performance slightly. Questions:
>
> 1) Is there any current work to allow the namenode to store on disk
> versus is memory? This could be a configurable option.
>
> 2) Besides possible slight degradation in performance, is there a reason
> why the BlocksMap shouldn't or couldn't be stored on disk?
As Doug mentioned the main worry is that this will drastically reduce
performance. Part of the reason is that large chunk of the work on
NamenNode happens under a single global lock. So if there is seek under
this lock, it affects every thing else.
One good long term fix for this is to make it easy to split the
namespace between multiple namenodes.. There was some work done on
supporting "volumes". Also the fact that HDFS now supports symbolic
links might make this easier for someone adventurous to use that as a
quick hack to get around this.
If you have a rough prototype implementation I am sure there will be a
lot of interest in evaluating it. If Java has any disk based or memory
mapped data structures, that might be the quickest way to try its affects.
Raghu.
Re: Namenode BlocksMap on Disk
Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.
> We can also try to mount the particular dir on ramfs
This could be an interesting project to replace the whole name-node
that is its FSDirectory part by ramfs.
If there are people interested in that kind of experiments we can discuss it.
>>> 2) Besides possible slight degradation in performance, is there a
>>> reason why the BlocksMap shouldn't or couldn't be stored on disk?
The degradation is expected to be substantial. But another reason is that it
will introduce a whole new layer in the name-node implementation responsible
of keeping the memory image in sync with the disk block map with caching etc.
As Raghu mentioned there was an idea of separating blockMap into a standalone
server, but currently there is no ongoing progress in this direction.
--Konstantin
Sagar Naik wrote:
> We can also try to mount the particular dir on ramfs and reduce the
> performance degradation
>
> -Sagar
> Billy Pearson wrote:
>> I would like to see something like this also I run 32bit servers so I
>> am limited on how much memory I can use for heap. Besides just storing
>> to disk I would like to see some sort of cache like a block cache that
>> will cache parts the BlocksMap this would help reduce the hits to disk
>> for lookups and still give us the ability to lower the memory
>> requirement for the namenode.
>>
>> Billy
>>
>>
>> "Dennis Kubes" <ku...@apache.org> wrote in message
>> news:492D3D15.7070401@apache.org...
>>> From time to time a message pops up on the mailing list about OOM
>>> errors for the namenode because of too many files. Most recently
>>> there was a 1.7 million file installation that was failing. I know
>>> the simple solution to this is to have a larger java heap for the
>>> namenode. But the non-simple way would be to convert the BlocksMap
>>> for the NameNode to be stored on disk and then queried and updated
>>> for operations. This would eliminate memory problems for large file
>>> installations but also might degrade performance slightly. Questions:
>>>
>>> 1) Is there any current work to allow the namenode to store on disk
>>> versus is memory? This could be a configurable option.
>>>
>>> 2) Besides possible slight degradation in performance, is there a
>>> reason why the BlocksMap shouldn't or couldn't be stored on disk?
>>>
>>> I am willing to put forth the work to make this happen. Just want to
>>> make sure I am not going down the wrong path to begin with.
>>>
>>> Dennis
>>>
>>
>>
>
>
Re: Namenode BlocksMap on Disk
Posted by Sagar Naik <sn...@attributor.com>.
We can also try to mount the particular dir on ramfs and reduce the
performance degradation
-Sagar
Billy Pearson wrote:
> I would like to see something like this also I run 32bit servers so I
> am limited on how much memory I can use for heap. Besides just storing
> to disk I would like to see some sort of cache like a block cache that
> will cache parts the BlocksMap this would help reduce the hits to disk
> for lookups and still give us the ability to lower the memory
> requirement for the namenode.
>
> Billy
>
>
> "Dennis Kubes" <ku...@apache.org> wrote in message
> news:492D3D15.7070401@apache.org...
>> From time to time a message pops up on the mailing list about OOM
>> errors for the namenode because of too many files. Most recently
>> there was a 1.7 million file installation that was failing. I know
>> the simple solution to this is to have a larger java heap for the
>> namenode. But the non-simple way would be to convert the BlocksMap
>> for the NameNode to be stored on disk and then queried and updated
>> for operations. This would eliminate memory problems for large file
>> installations but also might degrade performance slightly. Questions:
>>
>> 1) Is there any current work to allow the namenode to store on disk
>> versus is memory? This could be a configurable option.
>>
>> 2) Besides possible slight degradation in performance, is there a
>> reason why the BlocksMap shouldn't or couldn't be stored on disk?
>>
>> I am willing to put forth the work to make this happen. Just want to
>> make sure I am not going down the wrong path to begin with.
>>
>> Dennis
>>
>
>
Re: Namenode BlocksMap on Disk
Posted by Billy Pearson <sa...@pearsonwholesale.com>.
I would like to see something like this also I run 32bit servers so I am
limited on how much memory I can use for heap. Besides just storing to disk
I would like to see some sort of cache like a block cache that will cache
parts the BlocksMap this would help reduce the hits to disk for lookups and
still give us the ability to lower the memory requirement for the namenode.
Billy
"Dennis Kubes" <ku...@apache.org> wrote in
message news:492D3D15.7070401@apache.org...
> From time to time a message pops up on the mailing list about OOM errors
> for the namenode because of too many files. Most recently there was a 1.7
> million file installation that was failing. I know the simple solution to
> this is to have a larger java heap for the namenode. But the non-simple
> way would be to convert the BlocksMap for the NameNode to be stored on
> disk and then queried and updated for operations. This would eliminate
> memory problems for large file installations but also might degrade
> performance slightly. Questions:
>
> 1) Is there any current work to allow the namenode to store on disk versus
> is memory? This could be a configurable option.
>
> 2) Besides possible slight degradation in performance, is there a reason
> why the BlocksMap shouldn't or couldn't be stored on disk?
>
> I am willing to put forth the work to make this happen. Just want to make
> sure I am not going down the wrong path to begin with.
>
> Dennis
>
Re: Namenode BlocksMap on Disk
Posted by Doug Cutting <cu...@apache.org>.
Billy Pearson wrote:
> We are looking for a way to support smaller clusters also that might
> over run there heap size causing the cluster to crash.
Support for namespaces larger than RAM would indeed be a good feature to
have. Implementing this without impacting large cluster in-memory
namenode performance should be possible, but may or may not be easy.
You are welcome to tackle this task if it is a priority for you.
Doug
[NYC Hadoop meetup] 12/17 Cascading by Chris Wensel
Posted by Alex Dorman <ad...@contextweb.com>.
The next New York Hadoop User Group meeting is scheduled for Wednesday,
December 17th at ContextWeb, 6:30pm.
Join us for a talk on Cascading, an API for defining and executing
complex and fault tolerant data processing workflows on a Hadoop
cluster. Chris will specifically cover what the processing model looks
like, many of its core features, and the kinds of problems Cascading was
intended to solve or prevent.
To RSVP: http://www.meetup.com/Hadoop-NYC/calendar/9240064/
-Alex
Re: Namenode BlocksMap on Disk
Posted by Billy Pearson <sa...@pearsonwholesale.com>.
Doug:
If we use the heap as a cache and you have a large cluster then you will
have the memory on the NN to handle keeping all the namespace in memory.
We are looking for a way to support smaller clusters also that might over
run there heap size causing the cluster to crash.
So if the NN has the room to cache all the namespace then the larger
clusters will not see any disk hits once the namespace is fully loaded in to
memory.
Billy
"Doug Cutting" <cu...@apache.org> wrote in
message news:492D90A5.8090707@apache.org...
> Dennis Kubes wrote:
>> 2) Besides possible slight degradation in performance, is there a reason
>> why the BlocksMap shouldn't or couldn't be stored on disk?
>
> I think the assumption is that it would be considerably more than slight
> degradation. I've seen the namenode benchmarked at over 50,000 opens per
> second. If file data is on disk, and the namespace is considerably bigger
> than RAM, then a seek would be required per access. At 10MS/seek, that
> would give only 100 opens per second, or 500x slower. Flash storage today
> peaks at around 5k seeks/second.
>
> For smaller clusters the namenode might not need to be able to perform 50k
> opens/second, but for larger clusters we do not want the namenode to
> become a bottleneck.
>
> Doug
>
Re: Namenode BlocksMap on Disk
Posted by Doug Cutting <cu...@apache.org>.
Brian Bockelman wrote:
> Do you have any graphs you can share showing 50k opens / second (could
> be publicly or privately)? The more external benchmarking data I have,
> the more I can encourage adoption amongst my university...
The 50k opens/second is from some internal benchmarks run at Y! nearly a
year ago. (It doesn't look like Y! runs that benchmark regularly
anymore, as far as I can tell.) I copied the graph to:
http://people.apache.org/~cutting/nn500.png
Note that all of the operations that modify the namespace top out at
around 5k/second, since these are logged & flushed to disk.
I found some more recent micro namenode benchmarks at:
http://tinyurl.com/6bxoxz
These indicate that actual use doesn't hit these levels, but would
still, on large clusters, be adversely affected by moving to a
disk-based namespace.
Doug
Re: Namenode BlocksMap on Disk
Posted by Brian Bockelman <bb...@cse.unl.edu>.
On Nov 26, 2008, at 12:08 PM, Doug Cutting wrote:
> Dennis Kubes wrote:
>> 2) Besides possible slight degradation in performance, is there a
>> reason why the BlocksMap shouldn't or couldn't be stored on disk?
>
> I think the assumption is that it would be considerably more than
> slight degradation. I've seen the namenode benchmarked at over
> 50,000 opens per second. If file data is on disk, and the namespace
> is considerably bigger than RAM, then a seek would be required per
> access. At 10MS/seek, that would give only 100 opens per second, or
> 500x slower. Flash storage today peaks at around 5k seeks/second.
>
> For smaller clusters the namenode might not need to be able to
> perform 50k opens/second, but for larger clusters we do not want the
> namenode to become a bottleneck.
>
:)
Do you have any graphs you can share showing 50k opens / second (could
be publicly or privately)? The more external benchmarking data I
have, the more I can encourage adoption amongst my university...
Brian
Re: Namenode BlocksMap on Disk
Posted by Doug Cutting <cu...@apache.org>.
Dennis Kubes wrote:
> 2) Besides possible slight degradation in performance, is there a reason
> why the BlocksMap shouldn't or couldn't be stored on disk?
I think the assumption is that it would be considerably more than slight
degradation. I've seen the namenode benchmarked at over 50,000 opens
per second. If file data is on disk, and the namespace is considerably
bigger than RAM, then a seek would be required per access. At
10MS/seek, that would give only 100 opens per second, or 500x slower.
Flash storage today peaks at around 5k seeks/second.
For smaller clusters the namenode might not need to be able to perform
50k opens/second, but for larger clusters we do not want the namenode to
become a bottleneck.
Doug