You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Dennis Kubes <ku...@apache.org> on 2008/11/26 13:12:05 UTC

Namenode BlocksMap on Disk

 From time to time a message pops up on the mailing list about OOM 
errors for the namenode because of too many files.  Most recently there 
was a 1.7 million file installation that was failing.  I know the simple 
solution to this is to have a larger java heap for the namenode.  But 
the non-simple way would be to convert the BlocksMap for the NameNode to 
be stored on disk and then queried and updated for operations.  This 
would eliminate memory problems for large file installations but also 
might degrade performance slightly.  Questions:

1) Is there any current work to allow the namenode to store on disk 
versus is memory?  This could be a configurable option.

2) Besides possible slight degradation in performance, is there a reason 
why the BlocksMap shouldn't or couldn't be stored on disk?

I am willing to put forth the work to make this happen.  Just want to 
make sure I am not going down the wrong path to begin with.

Dennis

Re: Namenode BlocksMap on Disk

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
Dennis Kubes wrote:
> 
>  From time to time a message pops up on the mailing list about OOM 
> errors for the namenode because of too many files.  Most recently there 
> was a 1.7 million file installation that was failing.  I know the simple 
> solution to this is to have a larger java heap for the namenode.  But 
> the non-simple way would be to convert the BlocksMap for the NameNode to 
> be stored on disk and then queried and updated for operations.  This 
> would eliminate memory problems for large file installations but also 
> might degrade performance slightly.  Questions:
> 
> 1) Is there any current work to allow the namenode to store on disk 
> versus is memory?  This could be a configurable option.
> 
> 2) Besides possible slight degradation in performance, is there a reason 
> why the BlocksMap shouldn't or couldn't be stored on disk?

As Doug mentioned the main worry is that this will drastically reduce 
performance. Part of the reason is that large chunk of the work on 
NamenNode happens under a single global lock. So if there is seek under 
this lock, it affects every thing else.

One good long term fix for this is to make it easy to split the 
namespace between multiple namenodes.. There was some work done on 
supporting "volumes". Also the fact that HDFS now supports symbolic 
links might make this easier for someone adventurous to use that as a 
quick hack to get around this.

If you have a rough prototype implementation I am sure there will be a 
lot of interest in evaluating it. If Java has any disk based or memory 
mapped data structures, that might be the quickest way to try its affects.

Raghu.

Re: Namenode BlocksMap on Disk

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.
 > We can also try to mount the particular dir on ramfs

This could be an interesting project to replace the whole name-node
that is its FSDirectory part by ramfs.
If there are people interested in that kind of experiments we can discuss it.

 >>> 2) Besides possible slight degradation in performance, is there a
 >>> reason why the BlocksMap shouldn't or couldn't be stored on disk?

The degradation is expected to be substantial. But another reason is that it
will introduce a whole new layer in the name-node implementation responsible
of keeping the memory image in sync with the disk block map with caching etc.

As Raghu mentioned there was an idea of separating blockMap into a standalone
server, but currently there is no ongoing progress in this direction.

--Konstantin

Sagar Naik wrote:
> We can also try to mount the particular dir on ramfs and reduce the 
> performance degradation
> 
> -Sagar
> Billy Pearson wrote:
>> I would like to see something like this also I run 32bit servers so I 
>> am limited on how much memory I can use for heap. Besides just storing 
>> to disk I would like to see some sort of cache like a block cache that 
>> will cache parts the BlocksMap this would help reduce the hits to disk 
>> for lookups and still give us the ability to lower the memory 
>> requirement for the namenode.
>>
>> Billy
>>
>>
>> "Dennis Kubes" <ku...@apache.org> wrote in message 
>> news:492D3D15.7070401@apache.org...
>>> From time to time a message pops up on the mailing list about OOM 
>>> errors for the namenode because of too many files.  Most recently 
>>> there was a 1.7 million file installation that was failing.  I know 
>>> the simple solution to this is to have a larger java heap for the 
>>> namenode.  But the non-simple way would be to convert the BlocksMap 
>>> for the NameNode to be stored on disk and then queried and updated 
>>> for operations.  This would eliminate memory problems for large file 
>>> installations but also might degrade performance slightly.  Questions:
>>>
>>> 1) Is there any current work to allow the namenode to store on disk 
>>> versus is memory?  This could be a configurable option.
>>>
>>> 2) Besides possible slight degradation in performance, is there a 
>>> reason why the BlocksMap shouldn't or couldn't be stored on disk?
>>>
>>> I am willing to put forth the work to make this happen.  Just want to 
>>> make sure I am not going down the wrong path to begin with.
>>>
>>> Dennis
>>>
>>
>>
> 
> 

Re: Namenode BlocksMap on Disk

Posted by Sagar Naik <sn...@attributor.com>.
We can also try to mount the particular dir on ramfs and reduce the 
performance degradation

-Sagar
Billy Pearson wrote:
> I would like to see something like this also I run 32bit servers so I 
> am limited on how much memory I can use for heap. Besides just storing 
> to disk I would like to see some sort of cache like a block cache that 
> will cache parts the BlocksMap this would help reduce the hits to disk 
> for lookups and still give us the ability to lower the memory 
> requirement for the namenode.
>
> Billy
>
>
> "Dennis Kubes" <ku...@apache.org> wrote in message 
> news:492D3D15.7070401@apache.org...
>> From time to time a message pops up on the mailing list about OOM 
>> errors for the namenode because of too many files.  Most recently 
>> there was a 1.7 million file installation that was failing.  I know 
>> the simple solution to this is to have a larger java heap for the 
>> namenode.  But the non-simple way would be to convert the BlocksMap 
>> for the NameNode to be stored on disk and then queried and updated 
>> for operations.  This would eliminate memory problems for large file 
>> installations but also might degrade performance slightly.  Questions:
>>
>> 1) Is there any current work to allow the namenode to store on disk 
>> versus is memory?  This could be a configurable option.
>>
>> 2) Besides possible slight degradation in performance, is there a 
>> reason why the BlocksMap shouldn't or couldn't be stored on disk?
>>
>> I am willing to put forth the work to make this happen.  Just want to 
>> make sure I am not going down the wrong path to begin with.
>>
>> Dennis
>>
>
>


Re: Namenode BlocksMap on Disk

Posted by Billy Pearson <sa...@pearsonwholesale.com>.
I would like to see something like this also I run 32bit servers so I am 
limited on how much memory I can use for heap. Besides just storing to disk 
I would like to see some sort of cache like a block cache that will cache 
parts the BlocksMap this would help reduce the hits to disk for lookups and 
still give us the ability to lower the memory requirement for the namenode.

Billy


"Dennis Kubes" <ku...@apache.org> wrote in 
message news:492D3D15.7070401@apache.org...
> From time to time a message pops up on the mailing list about OOM errors 
> for the namenode because of too many files.  Most recently there was a 1.7 
> million file installation that was failing.  I know the simple solution to 
> this is to have a larger java heap for the namenode.  But the non-simple 
> way would be to convert the BlocksMap for the NameNode to be stored on 
> disk and then queried and updated for operations.  This would eliminate 
> memory problems for large file installations but also might degrade 
> performance slightly.  Questions:
>
> 1) Is there any current work to allow the namenode to store on disk versus 
> is memory?  This could be a configurable option.
>
> 2) Besides possible slight degradation in performance, is there a reason 
> why the BlocksMap shouldn't or couldn't be stored on disk?
>
> I am willing to put forth the work to make this happen.  Just want to make 
> sure I am not going down the wrong path to begin with.
>
> Dennis
> 



Re: Namenode BlocksMap on Disk

Posted by Doug Cutting <cu...@apache.org>.
Billy Pearson wrote:
> We are looking for a way to support smaller clusters also that might 
> over run there heap size causing the cluster to crash.

Support for namespaces larger than RAM would indeed be a good feature to 
have.  Implementing this without impacting large cluster in-memory 
namenode performance should be possible, but may or may not be easy. 
You are welcome to tackle this task if it is a priority for you.

Doug

[NYC Hadoop meetup] 12/17 Cascading by Chris Wensel

Posted by Alex Dorman <ad...@contextweb.com>.
The next New York Hadoop User Group meeting is scheduled for Wednesday,
December 17th at ContextWeb, 6:30pm.

Join us for a talk on Cascading, an API for defining and executing
complex and fault tolerant data processing workflows on a Hadoop
cluster. Chris will specifically cover what the processing model looks
like, many of its core features, and the kinds of problems Cascading was
intended to solve or prevent.

To RSVP: http://www.meetup.com/Hadoop-NYC/calendar/9240064/

-Alex


Re: Namenode BlocksMap on Disk

Posted by Billy Pearson <sa...@pearsonwholesale.com>.
Doug:
If we use the heap as a cache and you have a large cluster then you will 
have the memory on the NN to handle keeping all the namespace in memory.
We are looking for a way to support smaller clusters also that might over 
run there heap size causing the cluster to crash.
So if the NN has the room to cache all the namespace then the larger 
clusters will not see any disk hits once the namespace is fully loaded in to 
memory.

Billy



"Doug Cutting" <cu...@apache.org> wrote in 
message news:492D90A5.8090707@apache.org...
> Dennis Kubes wrote:
>> 2) Besides possible slight degradation in performance, is there a reason 
>> why the BlocksMap shouldn't or couldn't be stored on disk?
>
> I think the assumption is that it would be considerably more than slight 
> degradation.  I've seen the namenode benchmarked at over 50,000 opens per 
> second.  If file data is on disk, and the namespace is considerably bigger 
> than RAM, then a seek would be required per access.  At 10MS/seek, that 
> would give only 100 opens per second, or 500x slower. Flash storage today 
> peaks at around 5k seeks/second.
>
> For smaller clusters the namenode might not need to be able to perform 50k 
> opens/second, but for larger clusters we do not want the namenode to 
> become a bottleneck.
>
> Doug
> 



Re: Namenode BlocksMap on Disk

Posted by Doug Cutting <cu...@apache.org>.
Brian Bockelman wrote:
> Do you have any graphs you can share showing 50k opens / second (could 
> be publicly or privately)?  The more external benchmarking data I have, 
> the more I can encourage adoption amongst my university...

The 50k opens/second is from some internal benchmarks run at Y! nearly a 
year ago.  (It doesn't look like Y! runs that benchmark regularly 
anymore, as far as I can tell.)  I copied the graph to:

http://people.apache.org/~cutting/nn500.png

Note that all of the operations that modify the namespace top out at 
around 5k/second, since these are logged & flushed to disk.

I found some more recent micro namenode benchmarks at:

http://tinyurl.com/6bxoxz

These indicate that actual use doesn't hit these levels, but would 
still, on large clusters, be adversely affected by moving to a 
disk-based namespace.

Doug


Re: Namenode BlocksMap on Disk

Posted by Brian Bockelman <bb...@cse.unl.edu>.
On Nov 26, 2008, at 12:08 PM, Doug Cutting wrote:

> Dennis Kubes wrote:
>> 2) Besides possible slight degradation in performance, is there a  
>> reason why the BlocksMap shouldn't or couldn't be stored on disk?
>
> I think the assumption is that it would be considerably more than  
> slight degradation.  I've seen the namenode benchmarked at over  
> 50,000 opens per second.  If file data is on disk, and the namespace  
> is considerably bigger than RAM, then a seek would be required per  
> access.  At 10MS/seek, that would give only 100 opens per second, or  
> 500x slower. Flash storage today peaks at around 5k seeks/second.
>
> For smaller clusters the namenode might not need to be able to  
> perform 50k opens/second, but for larger clusters we do not want the  
> namenode to become a bottleneck.
>

:)

Do you have any graphs you can share showing 50k opens / second (could  
be publicly or privately)?  The more external benchmarking data I  
have, the more I can encourage adoption amongst my university...

Brian


Re: Namenode BlocksMap on Disk

Posted by Doug Cutting <cu...@apache.org>.
Dennis Kubes wrote:
> 2) Besides possible slight degradation in performance, is there a reason 
> why the BlocksMap shouldn't or couldn't be stored on disk?

I think the assumption is that it would be considerably more than slight 
degradation.  I've seen the namenode benchmarked at over 50,000 opens 
per second.  If file data is on disk, and the namespace is considerably 
bigger than RAM, then a seek would be required per access.  At 
10MS/seek, that would give only 100 opens per second, or 500x slower. 
Flash storage today peaks at around 5k seeks/second.

For smaller clusters the namenode might not need to be able to perform 
50k opens/second, but for larger clusters we do not want the namenode to 
become a bottleneck.

Doug