You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2006/03/17 21:14:02 UTC

[jira] Updated: (HADOOP-93) allow minimum split size configurable

     [ http://issues.apache.org/jira/browse/HADOOP-93?page=all ]

Hairong Kuang updated HADOOP-93:
--------------------------------

    Attachment: hadoop-93.fix

> allow minimum split size configurable
> -------------------------------------
>
>          Key: HADOOP-93
>          URL: http://issues.apache.org/jira/browse/HADOOP-93
>      Project: Hadoop
>         Type: Bug
>     Reporter: Hairong Kuang
>  Attachments: hadoop-93.fix
>
> The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
> The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira