You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by erolagnab <tr...@gmail.com> on 2007/07/15 17:04:38 UTC

How much RAMs needed...

I have a HDFS with 2 datanodes and 1 namenode in 3 different machines, 2G ram
each.
Datanode A contains around 700,000 blocks and Datanode B contains 1,200,000+
blocks, the namenode fails to start due to out of memory when trying to add
Datanode B into its rack. I have adjusted the java heap memory to 1600MB
which is the maxinum. But it still runs out of memory.

AFAIK, namenode loads all blocks information into the memory. If so, then is
there anyway to estimate how much ram needed for a HDFS with given number of
blocks in each datanode?
-- 
View this message in context: http://www.nabble.com/How-much-RAMs-needed...-tf4082367.html#a11603027
Sent from the Hadoop Users mailing list archive at Nabble.com.

Re: How much RAMs needed...

Posted by Ted Dunning <td...@veoh.com>.

HDFS can't really do the combination into larger files, but if you can do
that, it will help quite a bit.

You might need a custom InputFormat or split to make it all sing, but you
should be much better off with fewer large input files.

One of the biggest advantages will be that your disk reading will be much
more linear with much less seeking.


On 7/15/07 6:26 PM, "Nguyen Kien Trung" <tr...@gmail.com> wrote:

> Thanks Ted,
> 
> Unfortunately, those files are really tiny files. Is it a good practice
> if HDFS can combine those tiny files into a single block which fits a
> standard size of 64M?
> 
> Ted Dunning wrote:
>> Are these really tiny files, or are you really storing 2M x 100MB = 200TB of
>> data? Or do you have more like 2M x 10KB = 20GB of data?
>> 
>> Map-reduce and HDFS will generally work much better if you can arrange to
>> have relatively larger files.
>> 
>> 
>> On 7/15/07 8:04 AM, "erolagnab" <tr...@gmail.com> wrote:
>> 
>>   
>>> I have a HDFS with 2 datanodes and 1 namenode in 3 different machines, 2G
>>> ram
>>> each.
>>> Datanode A contains around 700,000 blocks and Datanode B contains 1,200,000+
>>> blocks, the namenode fails to start due to out of memory when trying to add
>>> Datanode B into its rack. I have adjusted the java heap memory to 1600MB
>>> which is the maxinum. But it still runs out of memory.
>>> 
>>> AFAIK, namenode loads all blocks information into the memory. If so, then is
>>> there anyway to estimate how much ram needed for a HDFS with given number of
>>> blocks in each datanode?
>>>     
>> 
>> 
>>   
>

Re: How much RAMs needed...

Posted by Nguyen Kien Trung <tr...@gmail.com>.

Thanks Ted,

Unfortunately, those files are really tiny files. Is it a good practice 
if HDFS can combine those tiny files into a single block which fits a 
standard size of 64M?

Ted Dunning wrote:
> Are these really tiny files, or are you really storing 2M x 100MB = 200TB of
> data? Or do you have more like 2M x 10KB = 20GB of data?
>
> Map-reduce and HDFS will generally work much better if you can arrange to
> have relatively larger files.
>
>
> On 7/15/07 8:04 AM, "erolagnab" <tr...@gmail.com> wrote:
>
>   
>> I have a HDFS with 2 datanodes and 1 namenode in 3 different machines, 2G ram
>> each.
>> Datanode A contains around 700,000 blocks and Datanode B contains 1,200,000+
>> blocks, the namenode fails to start due to out of memory when trying to add
>> Datanode B into its rack. I have adjusted the java heap memory to 1600MB
>> which is the maxinum. But it still runs out of memory.
>>
>> AFAIK, namenode loads all blocks information into the memory. If so, then is
>> there anyway to estimate how much ram needed for a HDFS with given number of
>> blocks in each datanode?
>>     
>
>
>

Re: How much RAMs needed...

Posted by "Peter W." <pe...@marketingbrokers.com>.

Trung,

Someone more knowledgeable will need to help.

It's my very simple understanding that Hadoop DFS
creates multiple blocks for every file being processed.

The JVM heap being exceeded could possibly be a file
handle issue instead of being due to overall block count.

In other words, your namenode should be able to
start up and handle far more input records if fewer
and bigger files on the datanodes are used.

Best,

Peter W.

On Jul 15, 2007, at 7:07 PM, Nguyen Kien Trung wrote:

> Hi Peter,
>
> I appreciate for the info. I'm afraid I'm not getting what you mean.
> The issue I've encountered is i'm not able to start up the namenode  
> due to out of memory error. Given that there are huge number of  
> tiny files in datanodes.

RE: How much RAMs needed...

Posted by Dhruba Borthakur <dh...@yahoo-inc.com>.

The metadata for each file occupies some portion of main memory. After your
8 hour run, the number of files in the Namenode is nearing the maximum
possible of 2GB of memory. At this time, the JVM GC process runs frantically
to cleanup memory allocations and consumes lots of CPU in the process. You
will see Datanodes and clients timing out because Namenode is CPU busy.
Shortly after that you might see OutOfMemory errors on Namenode.

It would be nice to know how many files you were successfully able to create
before the timeouts started occurring. Also, how many blocks were there in
each file?

Thanks,
Dhruba

-----Original Message-----
From: Nguyen Kien Trung [mailto:trung.n.k@gmail.com] 
Sent: Monday, July 16, 2007 8:13 PM
To: hadoop-user@lucene.apache.org
Subject: Re: How much RAMs needed...

Thanks Peter and Ted, your explanations do make some sense to me.

The out of memory error is as follows:
java.lang.OutOfMemoryError
        at sun.misc.Unsafe.allocateMemory(Native Method)
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:99)
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
        at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:57)
        at sun.nio.ch.IOUtil.read(IOUtil.java:205)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:207)
        at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:482)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:350)
        at org.apache.hadoop.ipc.Server$Listener.run(Server.java:267)
in which I think the problem is not due to block count.

I decided to give another try.
I formated a new namenode and wrote a program which has 100 threads 
writing individual files consistently in to HDFS with 2 datanodes (A and 
B) and replication factor is 2. In a first few hours, blocks are equally 
replicated correctly among 2 datanodes (i used browser to see number of 
blocks in 2 datanodes), but after 8 hours, HDFS started having strange 
behavior:
1) Blocks were not equally replicated
2) Timed out for RPC (on datanodes)
3) Out of Memory Error (on namenode)

I'm going to look deeper into Hadoop code and see. Any other thoughts?

Ted Dunning wrote:
> Peter is pointing out that he was able to process the equivalent of many
> small files using very modest hardware (smaller than your hardware).
>
> This is confirmation that you need to combine your inputs into larger
> chunks.
>
>
> On 7/15/07 7:07 PM, "Nguyen Kien Trung" <tr...@gmail.com> wrote:
>
>   
>> Hi Peter,
>>
>> I appreciate for the info. I'm afraid I'm not getting what you mean.
>> The issue I've encountered is i'm not able to start up the namenode due
>> to out of memory error. Given that there are huge number of tiny files
>> in datanodes.
>>
>> Cheers,
>>
>> Trung
>>
>> Peter W. wrote:
>>     
>>> Trung,
>>>
>>> Using one machine (with 2GB RAM) and 300 input files
>>> I was able to successfully run:
>>>
>>> INFO mapred.JobClient:
>>>
>>> Map input records=10785802
>>> Map output records=10785802
>>> Map input bytes=1302175673
>>> Map output bytes=1101864522
>>> Reduce input groups=1244034
>>> Reduce input records=10785802
>>> Reduce output records=1050704
>>>
>>> Consolidating the files in your input
>>> directories might help.
>>>
>>> Peter W.
>>>
>>>
>>> On Jul 15, 2007, at 5:40 PM, Ted Dunning wrote:
>>>
>>>       
>>>> Are these really tiny files, or are you really storing 2M x 100MB =
>>>> 200TB of
>>>> data? Or do you have more like 2M x 10KB = 20GB of data?
>>>>
>>>> Map-reduce and HDFS will generally work much better if you can
>>>> arrange to
>>>> have relatively larger files.
>>>>
>>>>
>>>> On 7/15/07 8:04 AM, "erolagnab" <tr...@gmail.com> wrote:
>>>>
>>>>         
>>>>> I have a HDFS with 2 datanodes and 1 namenode in 3 different
>>>>> machines, 2G ram
>>>>> each.
>>>>> Datanode A contains around 700,000 blocks and Datanode B contains
>>>>> 1,200,000+
>>>>> blocks, the namenode fails to start due to out of memory when trying
>>>>> to add
>>>>> Datanode B into its rack. I have adjusted the java heap memory to
>>>>> 1600MB
>>>>> which is the maxinum. But it still runs out of memory.
>>>>>
>>>>> AFAIK, namenode loads all blocks information into the memory. If so,
>>>>> then is
>>>>> there anyway to estimate how much ram needed for a HDFS with given
>>>>> number of
>>>>> blocks in each datanode?
>>>>>           
>>>       
>
>
>

Re: How much RAMs needed...

Posted by Nguyen Kien Trung <tr...@gmail.com>.

Thanks Peter and Ted, your explanations do make some sense to me.

The out of memory error is as follows:
java.lang.OutOfMemoryError
        at sun.misc.Unsafe.allocateMemory(Native Method)
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:99)
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
        at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:57)
        at sun.nio.ch.IOUtil.read(IOUtil.java:205)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:207)
        at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:482)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:350)
        at org.apache.hadoop.ipc.Server$Listener.run(Server.java:267)
in which I think the problem is not due to block count.

I decided to give another try.
I formated a new namenode and wrote a program which has 100 threads 
writing individual files consistently in to HDFS with 2 datanodes (A and 
B) and replication factor is 2. In a first few hours, blocks are equally 
replicated correctly among 2 datanodes (i used browser to see number of 
blocks in 2 datanodes), but after 8 hours, HDFS started having strange 
behavior:
1) Blocks were not equally replicated
2) Timed out for RPC (on datanodes)
3) Out of Memory Error (on namenode)

I'm going to look deeper into Hadoop code and see. Any other thoughts?

Ted Dunning wrote:
> Peter is pointing out that he was able to process the equivalent of many
> small files using very modest hardware (smaller than your hardware).
>
> This is confirmation that you need to combine your inputs into larger
> chunks.
>
>
> On 7/15/07 7:07 PM, "Nguyen Kien Trung" <tr...@gmail.com> wrote:
>
>   
>> Hi Peter,
>>
>> I appreciate for the info. I'm afraid I'm not getting what you mean.
>> The issue I've encountered is i'm not able to start up the namenode due
>> to out of memory error. Given that there are huge number of tiny files
>> in datanodes.
>>
>> Cheers,
>>
>> Trung
>>
>> Peter W. wrote:
>>     
>>> Trung,
>>>
>>> Using one machine (with 2GB RAM) and 300 input files
>>> I was able to successfully run:
>>>
>>> INFO mapred.JobClient:
>>>
>>> Map input records=10785802
>>> Map output records=10785802
>>> Map input bytes=1302175673
>>> Map output bytes=1101864522
>>> Reduce input groups=1244034
>>> Reduce input records=10785802
>>> Reduce output records=1050704
>>>
>>> Consolidating the files in your input
>>> directories might help.
>>>
>>> Peter W.
>>>
>>>
>>> On Jul 15, 2007, at 5:40 PM, Ted Dunning wrote:
>>>
>>>       
>>>> Are these really tiny files, or are you really storing 2M x 100MB =
>>>> 200TB of
>>>> data? Or do you have more like 2M x 10KB = 20GB of data?
>>>>
>>>> Map-reduce and HDFS will generally work much better if you can
>>>> arrange to
>>>> have relatively larger files.
>>>>
>>>>
>>>> On 7/15/07 8:04 AM, "erolagnab" <tr...@gmail.com> wrote:
>>>>
>>>>         
>>>>> I have a HDFS with 2 datanodes and 1 namenode in 3 different
>>>>> machines, 2G ram
>>>>> each.
>>>>> Datanode A contains around 700,000 blocks and Datanode B contains
>>>>> 1,200,000+
>>>>> blocks, the namenode fails to start due to out of memory when trying
>>>>> to add
>>>>> Datanode B into its rack. I have adjusted the java heap memory to
>>>>> 1600MB
>>>>> which is the maxinum. But it still runs out of memory.
>>>>>
>>>>> AFAIK, namenode loads all blocks information into the memory. If so,
>>>>> then is
>>>>> there anyway to estimate how much ram needed for a HDFS with given
>>>>> number of
>>>>> blocks in each datanode?
>>>>>           
>>>       
>
>
>

Re: How much RAMs needed...

Posted by Ted Dunning <td...@veoh.com>.

Peter is pointing out that he was able to process the equivalent of many
small files using very modest hardware (smaller than your hardware).

This is confirmation that you need to combine your inputs into larger
chunks.


On 7/15/07 7:07 PM, "Nguyen Kien Trung" <tr...@gmail.com> wrote:

> Hi Peter,
> 
> I appreciate for the info. I'm afraid I'm not getting what you mean.
> The issue I've encountered is i'm not able to start up the namenode due
> to out of memory error. Given that there are huge number of tiny files
> in datanodes.
> 
> Cheers,
> 
> Trung
> 
> Peter W. wrote:
>> Trung,
>> 
>> Using one machine (with 2GB RAM) and 300 input files
>> I was able to successfully run:
>> 
>> INFO mapred.JobClient:
>> 
>> Map input records=10785802
>> Map output records=10785802
>> Map input bytes=1302175673
>> Map output bytes=1101864522
>> Reduce input groups=1244034
>> Reduce input records=10785802
>> Reduce output records=1050704
>> 
>> Consolidating the files in your input
>> directories might help.
>> 
>> Peter W.
>> 
>> 
>> On Jul 15, 2007, at 5:40 PM, Ted Dunning wrote:
>> 
>>> 
>>> Are these really tiny files, or are you really storing 2M x 100MB =
>>> 200TB of
>>> data? Or do you have more like 2M x 10KB = 20GB of data?
>>> 
>>> Map-reduce and HDFS will generally work much better if you can
>>> arrange to
>>> have relatively larger files.
>>> 
>>> 
>>> On 7/15/07 8:04 AM, "erolagnab" <tr...@gmail.com> wrote:
>>> 
>>>> 
>>>> I have a HDFS with 2 datanodes and 1 namenode in 3 different
>>>> machines, 2G ram
>>>> each.
>>>> Datanode A contains around 700,000 blocks and Datanode B contains
>>>> 1,200,000+
>>>> blocks, the namenode fails to start due to out of memory when trying
>>>> to add
>>>> Datanode B into its rack. I have adjusted the java heap memory to
>>>> 1600MB
>>>> which is the maxinum. But it still runs out of memory.
>>>> 
>>>> AFAIK, namenode loads all blocks information into the memory. If so,
>>>> then is
>>>> there anyway to estimate how much ram needed for a HDFS with given
>>>> number of
>>>> blocks in each datanode?
>>> 
>> 
>> 
>

Re: How much RAMs needed...

Posted by Nguyen Kien Trung <tr...@gmail.com>.

Hi Peter,

I appreciate for the info. I'm afraid I'm not getting what you mean.
The issue I've encountered is i'm not able to start up the namenode due 
to out of memory error. Given that there are huge number of tiny files 
in datanodes.

Cheers,

Trung

Peter W. wrote:
> Trung,
>
> Using one machine (with 2GB RAM) and 300 input files
> I was able to successfully run:
>
> INFO mapred.JobClient:
>
> Map input records=10785802
> Map output records=10785802
> Map input bytes=1302175673
> Map output bytes=1101864522
> Reduce input groups=1244034
> Reduce input records=10785802
> Reduce output records=1050704
>
> Consolidating the files in your input
> directories might help.
>
> Peter W.
>
>
> On Jul 15, 2007, at 5:40 PM, Ted Dunning wrote:
>
>>
>> Are these really tiny files, or are you really storing 2M x 100MB = 
>> 200TB of
>> data? Or do you have more like 2M x 10KB = 20GB of data?
>>
>> Map-reduce and HDFS will generally work much better if you can 
>> arrange to
>> have relatively larger files.
>>
>>
>> On 7/15/07 8:04 AM, "erolagnab" <tr...@gmail.com> wrote:
>>
>>>
>>> I have a HDFS with 2 datanodes and 1 namenode in 3 different 
>>> machines, 2G ram
>>> each.
>>> Datanode A contains around 700,000 blocks and Datanode B contains 
>>> 1,200,000+
>>> blocks, the namenode fails to start due to out of memory when trying 
>>> to add
>>> Datanode B into its rack. I have adjusted the java heap memory to 
>>> 1600MB
>>> which is the maxinum. But it still runs out of memory.
>>>
>>> AFAIK, namenode loads all blocks information into the memory. If so, 
>>> then is
>>> there anyway to estimate how much ram needed for a HDFS with given 
>>> number of
>>> blocks in each datanode?
>>
>
>

Re: How much RAMs needed...

Posted by "Peter W." <pe...@marketingbrokers.com>.

Trung,

Using one machine (with 2GB RAM) and 300 input files
I was able to successfully run:

INFO mapred.JobClient:

Map input records=10785802
Map output records=10785802
Map input bytes=1302175673
Map output bytes=1101864522
Reduce input groups=1244034
Reduce input records=10785802
Reduce output records=1050704

Consolidating the files in your input
directories might help.

Peter W.


On Jul 15, 2007, at 5:40 PM, Ted Dunning wrote:

>
> Are these really tiny files, or are you really storing 2M x 100MB =  
> 200TB of
> data? Or do you have more like 2M x 10KB = 20GB of data?
>
> Map-reduce and HDFS will generally work much better if you can  
> arrange to
> have relatively larger files.
>
>
> On 7/15/07 8:04 AM, "erolagnab" <tr...@gmail.com> wrote:
>
>>
>> I have a HDFS with 2 datanodes and 1 namenode in 3 different  
>> machines, 2G ram
>> each.
>> Datanode A contains around 700,000 blocks and Datanode B contains  
>> 1,200,000+
>> blocks, the namenode fails to start due to out of memory when  
>> trying to add
>> Datanode B into its rack. I have adjusted the java heap memory to  
>> 1600MB
>> which is the maxinum. But it still runs out of memory.
>>
>> AFAIK, namenode loads all blocks information into the memory. If  
>> so, then is
>> there anyway to estimate how much ram needed for a HDFS with given  
>> number of
>> blocks in each datanode?
>

Re: How much RAMs needed...

Posted by Ted Dunning <td...@veoh.com>.

Are these really tiny files, or are you really storing 2M x 100MB = 200TB of
data? Or do you have more like 2M x 10KB = 20GB of data?

Map-reduce and HDFS will generally work much better if you can arrange to
have relatively larger files.

On 7/15/07 8:04 AM, "erolagnab" <tr...@gmail.com> wrote:

> 
> I have a HDFS with 2 datanodes and 1 namenode in 3 different machines, 2G ram
> each.
> Datanode A contains around 700,000 blocks and Datanode B contains 1,200,000+
> blocks, the namenode fails to start due to out of memory when trying to add
> Datanode B into its rack. I have adjusted the java heap memory to 1600MB
> which is the maxinum. But it still runs out of memory.
> 
> AFAIK, namenode loads all blocks information into the memory. If so, then is
> there anyway to estimate how much ram needed for a HDFS with given number of
> blocks in each datanode?