You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Eric Baldeschwieler <er...@yahoo-inc.com> on 2006/03/21 06:21:33 UTC
a comment on HADOOP-93 -- allow minimum split size configurable

It doesn't seem like we are going to get exactly what we want by  
simply going with bigger splits.  What we really want is to read all  
of the information locally if possible.  We loose control of that by  
increasing the split beyond the block size.  This is relevant because  
we'll have more tasks like this in the future.  The simple  
aggregations and samples that keep coming up in user asks will all  
look like this.

It seems to me there are two ways to deal with this:

1) Make 300k task map jobs work efficiently.  How possible /  
impossible is this?

2) Make jobs which consume a set of blocks which are all local to a  
node.  This seems possible, but will require a fair rethink on APIs /  
abstractions.

---

Which way should we push things?  Seems that giving up node / switch  
locality on reads is not the right decision.  Which is where we just  
went with by allowing input sizes > the block.

PS Increasing the block size to 64M or 128M would clearly help some,  
but it does not handle the overall issue.  Although maybe 256M might  
prove an interesting size...

On Mar 17, 2006, at 3:35 PM, Owen O'Malley (JIRA) wrote:

>      [ http://issues.apache.org/jira/browse/HADOOP-93?page=all ]
>
> Owen O'Malley updated HADOOP-93:
> --------------------------------
>
>     Attachment:     (was: hadoop_87.fix)
>
>> allow minimum split size configurable
>> -------------------------------------
>>
>>          Key: HADOOP-93
>>          URL: http://issues.apache.org/jira/browse/HADOOP-93
>>      Project: Hadoop
>>         Type: Bug
>>   Components: mapred
>>     Versions: 0.1
>>     Reporter: Hairong Kuang
>>      Fix For: 0.1
>>  Attachments: hadoop-93.fix
>>
>> The current default split size is the size of a block (32M) and a  
>> SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We  
>> currently have a Map/Reduce application working on crawled  
>> docuements. Its input data consists of 356 sequence files, each of  
>> which is of a size around 30G. A jobtracker takes forever to  
>> launch the job because it needs to generate 356*30G/2K map tasks!
>> The proposed solution is to let the minimum split size  
>> configurable so that the programmer can control the number of  
>> tasks to generate.
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>