You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by sandeep paul <pa...@gmail.com> on 2014/09/11 11:15:24 UTC

HADOOP-1 Regarding dfs.block.size vs mapred.max.split.size

Hi ,

I need confirmation regarding this two parameters and how they affect
performance .

I know(read) that always *mapred.max.split.size * should be less that
*dfs.block.size,*
But we always have an option of specifying  *mapred.max.split.size  *greater
than *dfs.block.size,*
What will happen in that case will the FileInputFormat for calculating
splits allows ?or it takes *dfs.block.size  *as the split size .

Say if the framework allows then in that case one map-task will end up
processing  more than one block (which will not be in local machine
always),In that case how the performance Impact?.

It would be a great help if anyone can help me get rid of this confusion.

Thanks
sandeep

Re: HADOOP-1 Regarding dfs.block.size vs mapred.max.split.size

Posted by Vinayakumar B <vi...@apache.org>.
Hi Sandeep,

AFAIK,

1. "dfs.block.size" and "mapred.max.split.size" are related logically to
get the best performance in case of reading big files and to get the data
locality.

2. There is no strict rule in the framework for the max split size . You
can specify more than block size.

3. If the split size is more than the block size, then single map needs to
read multiple blocks. This block might be in some other node, which will
increase the I/O duration.

4. As I said before, you will loose the data locality gain, in case of
reading from multiple blocks which are located in different nodes.

Regards,
Vinay

On Thu, Sep 11, 2014 at 2:45 PM, sandeep paul <pa...@gmail.com>
wrote:

> Hi ,
>
> I need confirmation regarding this two parameters and how they affect
> performance .
>
> I know(read) that always *mapred.max.split.size * should be less that
> *dfs.block.size,*
> But we always have an option of specifying  *mapred.max.split.size
> *greater
> than *dfs.block.size,*
> What will happen in that case will the FileInputFormat for calculating
> splits allows ?or it takes *dfs.block.size  *as the split size .
>
> Say if the framework allows then in that case one map-task will end up
> processing  more than one block (which will not be in local machine
> always),In that case how the performance Impact?.
>
> It would be a great help if anyone can help me get rid of this confusion.
>
> Thanks
> sandeep
>