You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Tarandeep Singh <ta...@gmail.com> on 2009/06/11 20:06:16 UTC

Effects of increasing block size / min split size

Hi,

I am trying to understand the effects of increasing block size or minimum
split size. If I increase them, then a mapper will process more data,
effectively reducing the number of mappers that will be spawned. As there is
an overhead in starting mappers, so this seems good.

However, If I increase their values too much, what negative effects will
come up? Put in other words, how to compute what is the best number of
mappers to start for processing a given size data on a cluster.

For calculations, let us assume- 100G of data, 4 machines (dual core).

Also if I set the reuse jvm flag to -1, will it make a difference?

Thanks,
Tarandeep

Re: Effects of increasing block size / min split size

Posted by Harish Mallipeddi <ha...@gmail.com>.

>
> Owen, what were the values for other parameters for your sort benchmark,
> like- io.sort.* etc. Is this documented somewhere so that I can take a look
> or if you can please paste it here.
>
> thanks,
> Tarandeep


Tarandeep,

Check the following links:

http://developer.yahoo.com/blogs/hadoop/Yahoo2009.pdf
http://people.apache.org/~omalley/tera-2009/

Cheers,

-- 
Harish Mallipeddi
http://blog.poundbang.in

Re: Effects of increasing block size / min split size

Posted by Tarandeep Singh <ta...@gmail.com>.

On Fri, Jun 12, 2009 at 4:59 PM, Owen O'Malley <om...@apache.org> wrote:

> On Jun 11, 2009, at 11:06 AM, Tarandeep Singh wrote:
>
>  I am trying to understand the effects of increasing block size or minimum
>> split size. If I increase them, then a mapper will process more data,
>> effectively reducing the number of mappers that will be spawned. As there
>> is
>> an overhead in starting mappers, so this seems good.
>>
>
> Even more important is that the shuffle depends on the number of maps *
> reduces. For the sort benchmark, we found that it was much more performant
> to have a few very large maps (500MB+)

Owen, what were the values for other parameters for your sort benchmark,
like- io.sort.* etc. Is this documented somewhere so that I can take a look
or if you can please paste it here.

thanks,
Tarandeep

>
> -- Owen
>
>

Re: Effects of increasing block size / min split size

Posted by Owen O'Malley <om...@apache.org>.

On Jun 11, 2009, at 11:06 AM, Tarandeep Singh wrote:

> I am trying to understand the effects of increasing block size or  
> minimum
> split size. If I increase them, then a mapper will process more data,
> effectively reducing the number of mappers that will be spawned. As  
> there is
> an overhead in starting mappers, so this seems good.

Even more important is that the shuffle depends on the number of maps  
* reduces. For the sort benchmark, we found that it was much more  
performant to have a few very large maps (500MB+)

-- Owen

Re: Effects of increasing block size / min split size

Posted by Tarandeep Singh <ta...@gmail.com>.

Thanks Jothi...

-Tarandeep

On Fri, Jun 12, 2009 at 4:35 AM, Jothi Padmanabhan <jo...@yahoo-inc.com>wrote:

> If the number of maps is reduced,  it is possible that the size of
> individual map outputs might increase. A couple of possible issues come to
> mind immediately:
> 1.  Number of spills in the map might be more. This might incur extra cost
> during merging.
> 2. Also, while the reduces might pull in more data per fetch (which is
> good), it might also result in a state where the reducer is not able to
> store the map output in memory but needs to shuffle it to disk.
>
> JVM reuse should help, but if the individual task completion time is very
> high, there might not be any discernible performance gain.
>
> Jothi
>
>
> On 6/11/09 11:36 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>
> > Hi,
> >
> > I am trying to understand the effects of increasing block size or minimum
> > split size. If I increase them, then a mapper will process more data,
> > effectively reducing the number of mappers that will be spawned. As there
> is
> > an overhead in starting mappers, so this seems good.
> >
> > However, If I increase their values too much, what negative effects will
> > come up? Put in other words, how to compute what is the best number of
> > mappers to start for processing a given size data on a cluster.
> >
> > For calculations, let us assume- 100G of data, 4 machines (dual core).
> >
> > Also if I set the reuse jvm flag to -1, will it make a difference?
> >
> > Thanks,
> > Tarandeep
>
>

Re: Effects of increasing block size / min split size

Posted by Jothi Padmanabhan <jo...@yahoo-inc.com>.

If the number of maps is reduced,  it is possible that the size of
individual map outputs might increase. A couple of possible issues come to
mind immediately:
1.  Number of spills in the map might be more. This might incur extra cost
during merging.
2. Also, while the reduces might pull in more data per fetch (which is
good), it might also result in a state where the reducer is not able to
store the map output in memory but needs to shuffle it to disk.

JVM reuse should help, but if the individual task completion time is very
high, there might not be any discernible performance gain.

Jothi

On 6/11/09 11:36 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:

> Hi,
> 
> I am trying to understand the effects of increasing block size or minimum
> split size. If I increase them, then a mapper will process more data,
> effectively reducing the number of mappers that will be spawned. As there is
> an overhead in starting mappers, so this seems good.
> 
> However, If I increase their values too much, what negative effects will
> come up? Put in other words, how to compute what is the best number of
> mappers to start for processing a given size data on a cluster.
> 
> For calculations, let us assume- 100G of data, 4 machines (dual core).
> 
> Also if I set the reuse jvm flag to -1, will it make a difference?
> 
> Thanks,
> Tarandeep