You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rick Hangartner <ha...@strands.com> on 2008/11/21 07:22:49 UTC

Practical limits on number of blocks per datanode.

Hi,

We are in the midst of considering Hadoop as a prototype solution for  
a system we are building.  In the abstract Hadoop and MapReduce are  
very well-suited to our computational problem.  However, this email  
exchange has caused us some concern that we are hoping the user  
community might allay.  We've searched JIRA for relevant issues but  
didn't turn up anything. (We probably aren't as adept as we might be  
at surfacing appropriate items though.)

Here are the relevant numbers for the data we are using to prototype a  
system using Hadoop 0.18.1:

We have 16,000,000 files that are 10K each, or about 160GB total.  We  
have 10 datanodes with the default replication factor of 3.   Each  
file will probably be stored as a single block, right?  This means we  
would be storing 48,000,000 blocks on 10 datanodes or 4,800,000 blocks  
per node.

At 160GB, the total data is not particularly large.  Unfortunately,  
the attached email exchange suggests  we could have a problem with the  
large number of blocks per node.  We have considered combining a  
number of small files into larger files (say concatenating  sets of  
100 files into single larger files so we have 48,000 blocks that are  
1MB in size per node.)  This would not significantly effect our  
MapReduce algorithm, but it could undesirably complicate other  
components of the system that use this data.

Thanks in advance for any insights on the match between Hadoop (0.18.x  
and later) and our particular system requirements.

RDH

Begin forwarded message:

> From: Konstantin Shvachko <sh...@yahoo-inc.com>
> Date: November 17, 2008 6:27:42 PM PST
> To: core-user@hadoop.apache.org
> Subject: Re: The Case of a Long Running Hadoop System
> Reply-To: core-user@hadoop.apache.org
>
> Bagri,
>
> According to the numbers you posted your cluster has 6,000,000 block  
> replicas
> and only 12 data-nodes. The blocks are small on average about 78KB  
> according
> to fsck. So each node contains about 40GB worth of block data.
> But the number of blocks is really huge 500,000 per node. Is my math  
> correct?
> I haven't seen data-nodes that big yet.
> The problem here is that a data-node keeps a map of all its blocks  
> in memory.
> The map is a HashMap. With 500,000 entries you can get long lookup  
> times I guess.
> And also block reports can take long time.
>
> So I believe restarting name-node will not help you.
> You should somehow pack your small files into larger ones.
> Alternatively, you can increase your cluster size, probably 5 to 10  
> times larger.
> I don't remember whether we had any optimization patches related to  
> data-nodes
> block map since 0.15. Please advise if anybody remembers.
>
> Thanks,
> --Konstantin
>
>
> Abhijit Bagri wrote:
>> We do not have a secondary namenode because 0.15.3 has serious bug  
>> which truncates the namenode image if there is a failure while  
>> namenode fetches image from secondary namenode. See HADOOP-3069
>> I have a patched version of 0.15.3 for this issue. From the patch  
>> of HADOOP-3069, the changes are on namenode _and_ secondary  
>> namenode, which means I just cant fire up a seconday namenode.
>> - Bagri
>> On Nov 15, 2008, at 11:36 PM, Billy Pearson wrote:
>>> If I understand the secondary namenode merges the edits log in to  
>>> the fsimage and reduces the edit log size.
>>> Which is likely the root of your problems 8.5G seams large and  
>>> likely putting a strain on your master servers memory and io  
>>> bandwidth
>>> Why do you not have a secondary namenode?
>>>
>>> If you do not have the memory on the master I would look in to  
>>> stopping a datanode/tasktracker on a server and loading the  
>>> secondary namenode on it
>>>
>>> Let it run for a while and watch your log for the secondary  
>>> namenode you should see your edit log get smaller
>>>
>>> I am not an expert but that would be my first action.
>>>
>>> Billy
>>>
>>>

Re: Practical limits on number of blocks per datanode.

Posted by Johan Oskarsson <jo...@oskarsson.nu>.
Hi Rick,

unfortunately 4,800,000 blocks per node is going to be too much. Ideally
you'd want to merge your files into as few as possible, even 1MB per
file is quite small for Hadoop. Would it be possible to merge them into
hundreds of mbs or preferably gigabyte files?

In newer Hadoop versions there is an archive feature that can put many
files into an archive for you. This can then be processed transparently
by Hadoop. I haven't used that though so can't tell if it's worth the
effort.

I ran into issues with too many blocks per datanode before and it's not
fun, they start losing contact with the namenode with all kinds of
interesting side effects.

/Johan

Rick Hangartner wrote:
> 
> Hi,
> 
> We are in the midst of considering Hadoop as a prototype solution for a
> system we are building.  In the abstract Hadoop and MapReduce are very
> well-suited to our computational problem.  However, this email exchange
> has caused us some concern that we are hoping the user community might
> allay.  We've searched JIRA for relevant issues but didn't turn up
> anything. (We probably aren't as adept as we might be at surfacing
> appropriate items though.)
> 
> Here are the relevant numbers for the data we are using to prototype a
> system using Hadoop 0.18.1:
> 
> We have 16,000,000 files that are 10K each, or about 160GB total.  We
> have 10 datanodes with the default replication factor of 3.   Each file
> will probably be stored as a single block, right?  This means we would
> be storing 48,000,000 blocks on 10 datanodes or 4,800,000 blocks per node.
> 
> At 160GB, the total data is not particularly large.  Unfortunately, the
> attached email exchange suggests  we could have a problem with the large
> number of blocks per node.  We have considered combining a number of
> small files into larger files (say concatenating  sets of 100 files into
> single larger files so we have 48,000 blocks that are 1MB in size per
> node.)  This would not significantly effect our MapReduce algorithm, but
> it could undesirably complicate other components of the system that use
> this data.
> 
> Thanks in advance for any insights on the match between Hadoop (0.18.x
> and later) and our particular system requirements.
> 
> RDH
> 
> Begin forwarded message:
> 
>> From: Konstantin Shvachko <sh...@yahoo-inc.com>
>> Date: November 17, 2008 6:27:42 PM PST
>> To: core-user@hadoop.apache.org
>> Subject: Re: The Case of a Long Running Hadoop System
>> Reply-To: core-user@hadoop.apache.org
>>
>> Bagri,
>>
>> According to the numbers you posted your cluster has 6,000,000 block
>> replicas
>> and only 12 data-nodes. The blocks are small on average about 78KB
>> according
>> to fsck. So each node contains about 40GB worth of block data.
>> But the number of blocks is really huge 500,000 per node. Is my math
>> correct?
>> I haven't seen data-nodes that big yet.
>> The problem here is that a data-node keeps a map of all its blocks in
>> memory.
>> The map is a HashMap. With 500,000 entries you can get long lookup
>> times I guess.
>> And also block reports can take long time.
>>
>> So I believe restarting name-node will not help you.
>> You should somehow pack your small files into larger ones.
>> Alternatively, you can increase your cluster size, probably 5 to 10
>> times larger.
>> I don't remember whether we had any optimization patches related to
>> data-nodes
>> block map since 0.15. Please advise if anybody remembers.
>>
>> Thanks,
>> --Konstantin
>>
>>
>> Abhijit Bagri wrote:
>>> We do not have a secondary namenode because 0.15.3 has serious bug
>>> which truncates the namenode image if there is a failure while
>>> namenode fetches image from secondary namenode. See HADOOP-3069
>>> I have a patched version of 0.15.3 for this issue. From the patch of
>>> HADOOP-3069, the changes are on namenode _and_ secondary namenode,
>>> which means I just cant fire up a seconday namenode.
>>> - Bagri
>>> On Nov 15, 2008, at 11:36 PM, Billy Pearson wrote:
>>>> If I understand the secondary namenode merges the edits log in to
>>>> the fsimage and reduces the edit log size.
>>>> Which is likely the root of your problems 8.5G seams large and
>>>> likely putting a strain on your master servers memory and io bandwidth
>>>> Why do you not have a secondary namenode?
>>>>
>>>> If you do not have the memory on the master I would look in to
>>>> stopping a datanode/tasktracker on a server and loading the
>>>> secondary namenode on it
>>>>
>>>> Let it run for a while and watch your log for the secondary namenode
>>>> you should see your edit log get smaller
>>>>
>>>> I am not an expert but that would be my first action.
>>>>
>>>> Billy
>>>>
>>>>
>