You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Karan Jindal <ka...@students.iiit.ac.in> on 2010/06/16 14:10:05 UTC

How many records will be passed to a map function??

Hi all,

Given a scenario in which a input file contains total 1000 records (record
in a line) of total size 12k and I set number of map tasks to 2.
How many records will be passed to each map task? Is it the equal
distribution?

InputFormat = Text
Block size  = default block of hdfs

Hoping for a reply..

Regards
Karan

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


Re: How many records will be passed to a map function??

Posted by Aaron Kimball <aa...@cloudera.com>.
Short answer: FileInputFormat & friends generate splits based on byte
ranges.

Assuming your records are all equally sized, you'll get half your records in
each mapper. If your records have many different sizes represented, then
your mileage may vary.

- Aaron

On Fri, Jun 18, 2010 at 4:27 PM, Eric Sammer <es...@cloudera.com> wrote:

> Karan:
>
> In general, you should let Hadoop pick the number of mappers to use.
> In the case of only 1000 records @ 12k, performance will be better
> with a single mapper for IO bound jobs. When you force the number of
> map tasks, Hadoop will do the following:
>
> (Assuming FileInputFormat#getSplits(conf, numSplits) gets called)
>
> totalSize is sum size of all input files in bytes
> goalSize is totalSize / numSplits
> minSplitSize is conf value mapred.min.split.size (default 1)
>
> For each input file:
>  length = file.size()
>  while isSplitable(file) and length != 0
>    fileBlockSize is the block size of the file
>    minOfGoalBlock is min(goalSize, fileBlockSize)
>    realSplitSize is max(minSplitSize, minOfGoalBlock)
>
>    length is length minus realSplitSize (give or take)
>
> Note that it's actually more confusing than this, but this is the
> general idea. Let's plug in some numbers:
>
> 1 file
> totalSize = 12k file size
> blockSize = 64MB block
> numSplits = 2
> goalSize = 6k (12k / 2)
> minSplitSize = 1 (for FileInputFormat)
>
> minOfGoalBlock = 6k (6k < 64MB)
> realSplitSize = 6k (6k > 1)
>
> We end up with 2 splits, 6k each. RecordReaders then parse this into
> records.
>
> Note that this applies to the old APIs. The newer APIs work slightly
> different but I think the result is equivalent.
>
> (If anyone wants to double check my summation, I welcome it. This is
> some hairy code and these questions frequently come up.)
>
> Hope this helps.
>
> On Wed, Jun 16, 2010 at 8:10 AM, Karan Jindal
> <ka...@students.iiit.ac.in> wrote:
> >
> > Hi all,
> >
> > Given a scenario in which a input file contains total 1000 records
> (record
> > in a line) of total size 12k and I set number of map tasks to 2.
> > How many records will be passed to each map task? Is it the equal
> > distribution?
> >
> > InputFormat = Text
> > Block size  = default block of hdfs
> >
> > Hoping for a reply..
> >
> > Regards
> > Karan
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> >
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>

Re: How many records will be passed to a map function??

Posted by Eric Sammer <es...@cloudera.com>.
Karan:

In general, you should let Hadoop pick the number of mappers to use.
In the case of only 1000 records @ 12k, performance will be better
with a single mapper for IO bound jobs. When you force the number of
map tasks, Hadoop will do the following:

(Assuming FileInputFormat#getSplits(conf, numSplits) gets called)

totalSize is sum size of all input files in bytes
goalSize is totalSize / numSplits
minSplitSize is conf value mapred.min.split.size (default 1)

For each input file:
  length = file.size()
  while isSplitable(file) and length != 0
    fileBlockSize is the block size of the file
    minOfGoalBlock is min(goalSize, fileBlockSize)
    realSplitSize is max(minSplitSize, minOfGoalBlock)

    length is length minus realSplitSize (give or take)

Note that it's actually more confusing than this, but this is the
general idea. Let's plug in some numbers:

1 file
totalSize = 12k file size
blockSize = 64MB block
numSplits = 2
goalSize = 6k (12k / 2)
minSplitSize = 1 (for FileInputFormat)

minOfGoalBlock = 6k (6k < 64MB)
realSplitSize = 6k (6k > 1)

We end up with 2 splits, 6k each. RecordReaders then parse this into records.

Note that this applies to the old APIs. The newer APIs work slightly
different but I think the result is equivalent.

(If anyone wants to double check my summation, I welcome it. This is
some hairy code and these questions frequently come up.)

Hope this helps.

On Wed, Jun 16, 2010 at 8:10 AM, Karan Jindal
<ka...@students.iiit.ac.in> wrote:
>
> Hi all,
>
> Given a scenario in which a input file contains total 1000 records (record
> in a line) of total size 12k and I set number of map tasks to 2.
> How many records will be passed to each map task? Is it the equal
> distribution?
>
> InputFormat = Text
> Block size  = default block of hdfs
>
> Hoping for a reply..
>
> Regards
> Karan
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com