You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2006/02/14 20:26:43 UTC

[jira] Created: (HADOOP-38) default splitter should incorporate fs block size

default splitter should incorporate fs block size
-------------------------------------------------

         Key: HADOOP-38
         URL: http://issues.apache.org/jira/browse/HADOOP-38
     Project: Hadoop
        Type: Improvement
  Components: mapred  
    Reporter: Doug Cutting


By default, the file splitting code should operate as follows.

  inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
  output is <file,start,length>*

  totalSize = sum of all file sizes;

  desiredSplitSize = totalSize / numMapTasks;
  if (desiredSplitSize > fsBlockSize)             /* new */
    desiredSplitSize = fsBlockSize;
  if (desiredSplitSize < minSplitSize)
    desiredSplitSize = minSplitSize;

  chop input files into desiredSplitSize chunks & return them

In other words, the numMapTasks is a desired minimum.  We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.

If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.

This handles cases where:
  - each input record takes a lot of time to process.  In this case we want to make sure we use all of the cluster.  Thus it is important to permit splits smaller than the fs block size.
  - input i/o dominates.  In this case we want to permit the placement of tasks on hosts where their data is local.  This is only possible if splits are fs block size or smaller.

Are there other common cases that this algorithm does not handle well?

The part marked 'new' above is not currently implemented, but I'd like to add it.

Does this sound reasonble?


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Closed: (HADOOP-38) default splitter should incorporate fs block size

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-38?page=all ]
     
Doug Cutting closed HADOOP-38:
------------------------------


> default splitter should incorporate fs block size
> -------------------------------------------------
>
>          Key: HADOOP-38
>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Doug Cutting
>      Fix For: 0.1.0

>
> By default, the file splitting code should operate as follows.
>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>   output is <file,start,length>*
>   totalSize = sum of all file sizes;
>   desiredSplitSize = totalSize / numMapTasks;
>   if (desiredSplitSize > fsBlockSize)             /* new */
>     desiredSplitSize = fsBlockSize;
>   if (desiredSplitSize < minSplitSize)
>     desiredSplitSize = minSplitSize;
>   chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum.  We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.
> This handles cases where:
>   - each input record takes a lot of time to process.  In this case we want to make sure we use all of the cluster.  Thus it is important to permit splits smaller than the fs block size.
>   - input i/o dominates.  In this case we want to permit the placement of tasks on hosts where their data is local.  This is only possible if splits are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add it.
> Does this sound reasonble?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-38) default splitter should incorporate fs block size

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-38?page=comments#action_12366390 ] 

eric baldeschwieler commented on HADOOP-38:
-------------------------------------------

1GB blocks have a lot of issues.  Until your networks get faster and RAMs get bigger, this is probably too big.  For many of our current tasks 1GB is too much input for reasonable restartabilty too.  I think 32M to 128M are a lot closer to the current sweet spot.


> default splitter should incorporate fs block size
> -------------------------------------------------
>
>          Key: HADOOP-38
>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Doug Cutting

>
> By default, the file splitting code should operate as follows.
>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>   output is <file,start,length>*
>   totalSize = sum of all file sizes;
>   desiredSplitSize = totalSize / numMapTasks;
>   if (desiredSplitSize > fsBlockSize)             /* new */
>     desiredSplitSize = fsBlockSize;
>   if (desiredSplitSize < minSplitSize)
>     desiredSplitSize = minSplitSize;
>   chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum.  We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.
> This handles cases where:
>   - each input record takes a lot of time to process.  In this case we want to make sure we use all of the cluster.  Thus it is important to permit splits smaller than the fs block size.
>   - input i/o dominates.  In this case we want to permit the placement of tasks on hosts where their data is local.  This is only possible if splits are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add it.
> Does this sound reasonble?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-38) default splitter should incorporate fs block size

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-38?page=comments#action_12366421 ] 

eric baldeschwieler commented on HADOOP-38:
-------------------------------------------

some thoughts:

1) Eventually you are going to want raid / erasure coding style things.  The simplest way to do this without breaking reads is to batch several blocks, keeping them linear and then generate parity once all blocks are full.  This gets more expensive as block size increases.  At current sizes, this can all be buffered in RAM in some cases.  1GB blocks rule that out.

2) Currently you can trivially keep a block in RAM for a MAP task.  Depending on scaling factor, you can probably keep the output in ram for sorting, reduction, etc.  too.  This is nice.  As block size increases you loose this property.

3) When you loose a node, the finer grained the lost data, the fewer hotspots you have in the system.  Today in a large cluster you can easily have choke points with ~33mbit aggregate all to all.  We've seen problems with larger data sizes slowing recovery times to a real problem.  1GB blocks take 10x as long to transmit, and this turns into minutes, which will require more sophisticated management.

---

None of these are show stoppers, but one of the main reasons we are interested in hadoop is in getting off of our current very large storage chunk system, so I'd hate to see the default move quickly to something as large as 1GB.

I can see the advantages of pushing the block size up to manage task tracker RAM size, but I doubt that alone will prove a compelling reason for us to change our default block size.  On the other hand, I also don't think we'll be pumping 1 peta byte through a single m/r in the near term, so we can assume the zero code solution, change block size, until we have more data to support some other approach.

Of course at 1M tasks, you will want to be careful about linear scans anyway...

I've no concern with the proposal in this bug.  Probably can take this discussion elsewhere

> default splitter should incorporate fs block size
> -------------------------------------------------
>
>          Key: HADOOP-38
>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Doug Cutting

>
> By default, the file splitting code should operate as follows.
>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>   output is <file,start,length>*
>   totalSize = sum of all file sizes;
>   desiredSplitSize = totalSize / numMapTasks;
>   if (desiredSplitSize > fsBlockSize)             /* new */
>     desiredSplitSize = fsBlockSize;
>   if (desiredSplitSize < minSplitSize)
>     desiredSplitSize = minSplitSize;
>   chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum.  We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.
> This handles cases where:
>   - each input record takes a lot of time to process.  In this case we want to make sure we use all of the cluster.  Thus it is important to permit splits smaller than the fs block size.
>   - input i/o dominates.  In this case we want to permit the placement of tasks on hosts where their data is local.  This is only possible if splits are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add it.
> Does this sound reasonble?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: [jira] Commented: (HADOOP-38) default splitter should incorporate fs block size

Posted by Doug Cutting <cu...@apache.org>.
Eric Baldeschwieler wrote:
> You may simply want to specify the input size per job (maybe in  
> blocks?) and let the framework sort things out.

You can achieve that in my proposal by increasing the minSplitSize to 
something larger than the block size.  So that's already possible.  All 
that I'm suggesting is that the default is to try to make things one 
block per split, unless that results in too few splits.

> A possible optimization would be to read discontinuous blocks into  one 
> map job if you want to pump several blocks worth of data into  each 
> job.  Given the map/reduce mechanism, this should work, yes?

I think you mean multiple blocks per task.  That has potential 
restartability issues, since it's a lot like bigger blocks.  And the 
tasktracker still has to have some representation of every block in 
memory, so I'm not sure it makes the datastructure much smaller, which 
is my primary concern with large numbers of tasks.

A reason to usually keep the number of map tasks much greater than the 
number of CPUs is to reduce the impact of restarting map tasks, since 
most user computation is done while mapping.

A reason to usually keep the number of reduce tasks only slightly 
greater than the number of CPUs is to use all resources while not 
generating too many output files, since these might be combined with the 
output of other maps to form inputs, and we'd rather have fewer large 
inputs to split up than more smaller ones.

Doug

Re: [jira] Commented: (HADOOP-38) default splitter should incorporate fs block size

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.
You may simply want to specify the input size per job (maybe in  
blocks?) and let the framework sort things out.

A possible optimization would be to read discontinuous blocks into  
one map job if you want to pump several blocks worth of data into  
each job.  Given the map/reduce mechanism, this should work, yes?

On Feb 14, 2006, at 12:49 PM, Bryan Pendleton (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/HADOOP-38? 
> page=comments#action_12366381 ]
>
> Bryan Pendleton commented on HADOOP-38:
> ---------------------------------------
>
> The idea sounds sound, but is blocksize the best unit? There's a  
> certain overhead for each additional task added to a job - for jobs  
> with really large input, this could cause really large task lists.  
> Is there going to be any code for pre-replicating blocks? Maybe  
> sequences, so there'd be a natural "first choice" node for many  
> chunkings of larger than one block? Obviously, as datanodes come  
> and go this might not always work ideally, but it could help in the  
> 80% case.
>
>> default splitter should incorporate fs block size
>> -------------------------------------------------
>>
>>          Key: HADOOP-38
>>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>>      Project: Hadoop
>>         Type: Improvement
>>   Components: mapred
>>     Reporter: Doug Cutting
>
>>
>> By default, the file splitting code should operate as follows.
>>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>>   output is <file,start,length>*
>>   totalSize = sum of all file sizes;
>>   desiredSplitSize = totalSize / numMapTasks;
>>   if (desiredSplitSize > fsBlockSize)             /* new */
>>     desiredSplitSize = fsBlockSize;
>>   if (desiredSplitSize < minSplitSize)
>>     desiredSplitSize = minSplitSize;
>>   chop input files into desiredSplitSize chunks & return them
>> In other words, the numMapTasks is a desired minimum.  We'll try  
>> to chop input into at least numMapTasks chunks, each ideally a  
>> single fs block.
>> If there's not enough input data to create numMapTasks tasks, each  
>> with an entire block, then we'll permit tasks whose input is  
>> smaller than a filesystem block, down to a minimum split size.
>> This handles cases where:
>>   - each input record takes a lot of time to process.  In this  
>> case we want to make sure we use all of the cluster.  Thus it is  
>> important to permit splits smaller than the fs block size.
>>   - input i/o dominates.  In this case we want to permit the  
>> placement of tasks on hosts where their data is local.  This is  
>> only possible if splits are fs block size or smaller.
>> Are there other common cases that this algorithm does not handle  
>> well?
>> The part marked 'new' above is not currently implemented, but I'd  
>> like to add it.
>> Does this sound reasonble?
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>


[jira] Commented: (HADOOP-38) default splitter should incorporate fs block size

Posted by "Bryan Pendleton (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-38?page=comments#action_12366381 ] 

Bryan Pendleton commented on HADOOP-38:
---------------------------------------

The idea sounds sound, but is blocksize the best unit? There's a certain overhead for each additional task added to a job - for jobs with really large input, this could cause really large task lists. Is there going to be any code for pre-replicating blocks? Maybe sequences, so there'd be a natural "first choice" node for many chunkings of larger than one block? Obviously, as datanodes come and go this might not always work ideally, but it could help in the 80% case.

> default splitter should incorporate fs block size
> -------------------------------------------------
>
>          Key: HADOOP-38
>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Doug Cutting

>
> By default, the file splitting code should operate as follows.
>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>   output is <file,start,length>*
>   totalSize = sum of all file sizes;
>   desiredSplitSize = totalSize / numMapTasks;
>   if (desiredSplitSize > fsBlockSize)             /* new */
>     desiredSplitSize = fsBlockSize;
>   if (desiredSplitSize < minSplitSize)
>     desiredSplitSize = minSplitSize;
>   chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum.  We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.
> This handles cases where:
>   - each input record takes a lot of time to process.  In this case we want to make sure we use all of the cluster.  Thus it is important to permit splits smaller than the fs block size.
>   - input i/o dominates.  In this case we want to permit the placement of tasks on hosts where their data is local.  This is only possible if splits are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add it.
> Does this sound reasonble?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-38) default splitter should incorporate fs block size

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-38?page=comments#action_12366401 ] 

Doug Cutting commented on HADOOP-38:
------------------------------------

I'm not sure what RAM size or network speeds have to do with it: we stream blocks into a task, we don't read them all at once.

Restartability could be an issue.  If you have a petabyte of input, and you want restartability at 100M chunks, then that means you need to be able to support up to 10M tasks per job.  This is possible, but means the job tracker has to be even more careful not to store too much in RAM per task, nor iterate over all tasks, etc.

But I'm not convinced that 1GB would cause problems for restartability.  A petabyte input on 10k nodes (my current horizon), 1GB blocks gives 1M tasks, or 100 per node.  So each task will average around 1% of the execution time, so even those that are restarted near the end of job completion won't add much to the overall  time.

> default splitter should incorporate fs block size
> -------------------------------------------------
>
>          Key: HADOOP-38
>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Doug Cutting

>
> By default, the file splitting code should operate as follows.
>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>   output is <file,start,length>*
>   totalSize = sum of all file sizes;
>   desiredSplitSize = totalSize / numMapTasks;
>   if (desiredSplitSize > fsBlockSize)             /* new */
>     desiredSplitSize = fsBlockSize;
>   if (desiredSplitSize < minSplitSize)
>     desiredSplitSize = minSplitSize;
>   chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum.  We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.
> This handles cases where:
>   - each input record takes a lot of time to process.  In this case we want to make sure we use all of the cluster.  Thus it is important to permit splits smaller than the fs block size.
>   - input i/o dominates.  In this case we want to permit the placement of tasks on hosts where their data is local.  This is only possible if splits are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add it.
> Does this sound reasonble?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (HADOOP-38) default splitter should incorporate fs block size

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-38?page=all ]
     
Doug Cutting resolved HADOOP-38:
--------------------------------

    Resolution: Fixed

I just committed this.

> default splitter should incorporate fs block size
> -------------------------------------------------
>
>          Key: HADOOP-38
>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Doug Cutting

>
> By default, the file splitting code should operate as follows.
>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>   output is <file,start,length>*
>   totalSize = sum of all file sizes;
>   desiredSplitSize = totalSize / numMapTasks;
>   if (desiredSplitSize > fsBlockSize)             /* new */
>     desiredSplitSize = fsBlockSize;
>   if (desiredSplitSize < minSplitSize)
>     desiredSplitSize = minSplitSize;
>   chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum.  We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.
> This handles cases where:
>   - each input record takes a lot of time to process.  In this case we want to make sure we use all of the cluster.  Thus it is important to permit splits smaller than the fs block size.
>   - input i/o dominates.  In this case we want to permit the placement of tasks on hosts where their data is local.  This is only possible if splits are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add it.
> Does this sound reasonble?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-38) default splitter should incorporate fs block size

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-38?page=all ]

Doug Cutting updated HADOOP-38:
-------------------------------

    Fix Version: 0.1.0

> default splitter should incorporate fs block size
> -------------------------------------------------
>
>          Key: HADOOP-38
>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Doug Cutting
>      Fix For: 0.1.0

>
> By default, the file splitting code should operate as follows.
>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>   output is <file,start,length>*
>   totalSize = sum of all file sizes;
>   desiredSplitSize = totalSize / numMapTasks;
>   if (desiredSplitSize > fsBlockSize)             /* new */
>     desiredSplitSize = fsBlockSize;
>   if (desiredSplitSize < minSplitSize)
>     desiredSplitSize = minSplitSize;
>   chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum.  We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.
> This handles cases where:
>   - each input record takes a lot of time to process.  In this case we want to make sure we use all of the cluster.  Thus it is important to permit splits smaller than the fs block size.
>   - input i/o dominates.  In this case we want to permit the placement of tasks on hosts where their data is local.  This is only possible if splits are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add it.
> Does this sound reasonble?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-38) default splitter should incorporate fs block size

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-38?page=comments#action_12366385 ] 

Doug Cutting commented on HADOOP-38:
------------------------------------

The surest way to get larger chunks is to increase the block size.

The default DFS blocksize is currently 32MB, which gives 31k tasks for terabyte inputs, which is reasonable.  I think we should design things to be able to handle perhaps a million tasks, which, with the current block size, would get us to 32 terabyte inputs.

Perhaps the default should be 1GB/block.  With a million tasks, would get us to maximum of a petabyte per job.  On a 10k node cluster, a petabyte takes hours to read (100GB/node @ 10MB/second = 10k seconds).

We'll also need to revise the web UI to better handle a million tasks...


> default splitter should incorporate fs block size
> -------------------------------------------------
>
>          Key: HADOOP-38
>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Doug Cutting

>
> By default, the file splitting code should operate as follows.
>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>   output is <file,start,length>*
>   totalSize = sum of all file sizes;
>   desiredSplitSize = totalSize / numMapTasks;
>   if (desiredSplitSize > fsBlockSize)             /* new */
>     desiredSplitSize = fsBlockSize;
>   if (desiredSplitSize < minSplitSize)
>     desiredSplitSize = minSplitSize;
>   chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum.  We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.
> This handles cases where:
>   - each input record takes a lot of time to process.  In this case we want to make sure we use all of the cluster.  Thus it is important to permit splits smaller than the fs block size.
>   - input i/o dominates.  In this case we want to permit the placement of tasks on hosts where their data is local.  This is only possible if splits are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add it.
> Does this sound reasonble?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira