You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Syed Wasti <md...@hotmail.com> on 2010/07/15 20:40:27 UTC

Data Block Size ?

Hi,

I am new to hadoop and looking for some answers to clear my basic concepts on Hadoop.



Will it matter what the data block size is ? 

It is recommended to have a block size of 64 MB, but if we want to have the data block size to 128 MB, should this effect the performance ?

Does the size of the map jobs created on each datanodes in anyway depend the block size ?



Thanks for the support.



Regards

Syed


 		 	   		  

RE: Data Block Size ?

Posted by Syed Wasti <md...@hotmail.com>.
Thank you Allen.
So, is it fair to assume that if I have smaller block size (64 MB), then my blocks are distributed across more datanodes and because my blocks are around more datanodes, then my map jobs should also run on different datanodes and becuase the maps size will be smaller, it should execute faster using less resources. 
Should this work this way ? or is there any algorithm on how the blocks should be distributed across the datanodes and where should the replication copies should go ?

Lets say, I have a file of 640 MB and a cluster with 5 datanodes and configured the block size to be 64 MB. How will this be distributed ? 

Regards
Syed Wasti

> From: awittenauer@linkedin.com
> To: general@hadoop.apache.org
> Subject: Re: Data Block Size ?
> Date: Thu, 15 Jul 2010 18:49:04 +0000
> 
> 
> On Jul 15, 2010, at 11:40 AM, Syed Wasti wrote:
> 
> > Will it matter what the data block size is ? 
> 
> Yes.
> 
> > It is recommended to have a block size of 64 MB, but if we want to have the data block size to 128 MB, should this effect the performance ?
> 
> Yes.
> 
> FWIW, we run with 128MB.
> 
> > Does the size of the map jobs created on each datanodes in anyway depend the block size ?
> 
> Yes.
> 
> Unless told otherwise, Hadoop will generally use the # of maps == # of blocks.  So if you have fewer blocks to process, you'll have fewer maps to do more work.  This is not necessarily a bad thing; it all depends upon your workload, size of grid, etc.
> 
 		 	   		  

Re: Data Block Size ?

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jul 15, 2010, at 11:40 AM, Syed Wasti wrote:

> Will it matter what the data block size is ? 

Yes.

> It is recommended to have a block size of 64 MB, but if we want to have the data block size to 128 MB, should this effect the performance ?

Yes.

FWIW, we run with 128MB.

> Does the size of the map jobs created on each datanodes in anyway depend the block size ?

Yes.

Unless told otherwise, Hadoop will generally use the # of maps == # of blocks.  So if you have fewer blocks to process, you'll have fewer maps to do more work.  This is not necessarily a bad thing; it all depends upon your workload, size of grid, etc.