You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by abhishek sharma <ab...@usc.edu> on 2010/03/25 03:27:01 UTC

posted again: how are the splits for map tasks computed?

I realized that I made a mistake in my earlier post. So here is the correct one.

I have a job ("loadgen") with only 1 input (say) part-00000 of size
1368654 bytes.

So when I submit this job, I get the following output:

INFO mapred.FileInputFormat: Total input paths to process : 1

However, in the JobTracker log, I see the following entry:

 Split info for job:job_201003131110_0043 with 2 splits

and subsequently 2 map tasks are started to process these two splits.
The size of input splits to these 2 map tasks is 6843283. So the input
is divided equally into two splits.

My question is: Why are two map tasks created instead of one and why
is the combined size of the two splits greater than the size of my
input?

I also noticed that if I run the same job with 2 inputs (say)
part-00000 and part-00001, then only 2 map tasks are created.

To my knowledge, the number of map tasks should be the same as the
number of inputs.

Thanks,

RE: posted again: how are the splits for map tasks computed?

Posted by "Segel, Mike" <ms...@navteq.com>.

Ok, its 4:00am local time... silly question...
What's the block size of your HDFS ?
And the file sizes you said in bytes? So the full file is 12MB?

-----Original Message-----
From: absharma@gmail.com [mailto:absharma@gmail.com] On Behalf Of abhishek sharma
Sent: Wednesday, March 24, 2010 9:27 PM
To: common-dev@hadoop.apache.org; common-user@hadoop.apache.org
Subject: posted again: how are the splits for map tasks computed?

I realized that I made a mistake in my earlier post. So here is the correct one.

I have a job ("loadgen") with only 1 input (say) part-00000 of size
1368654 bytes.

So when I submit this job, I get the following output:

INFO mapred.FileInputFormat: Total input paths to process : 1

However, in the JobTracker log, I see the following entry:

 Split info for job:job_201003131110_0043 with 2 splits

and subsequently 2 map tasks are started to process these two splits.
The size of input splits to these 2 map tasks is 6843283. So the input
is divided equally into two splits.

My question is: Why are two map tasks created instead of one and why
is the combined size of the two splits greater than the size of my
input?

I also noticed that if I run the same job with 2 inputs (say)
part-00000 and part-00001, then only 2 map tasks are created.

To my knowledge, the number of map tasks should be the same as the
number of inputs.

Thanks,

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

Re: posted again: how are the splits for map tasks computed?

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi,
this is an interesting question. My understanding so far is that, the number of map tasks is basically defined by the number of splits. When hadoop splits the file, it will receive a hint from user (here is the number of mappers you set ). At the same time, hadoop maintain 3 parameters: HDFS block size, max & min split size. If the size of a split calculated based on your hint is between max and min split size, and no exceed one block size, the number of splits (also the number of mappers) is exactly what you define by that hint. Otherwise, hadoop will calculate the splits size by max(minSplitSize, min(maxSplitSize, blockSize)), according to Tom White's book. Consequently, the number of splits (mappers) is fileSize/splitSize.

In your case, the default split size is 64 MB. If you set # mappers to 1, that split will exceed 64MB, so hadoop doesn't take your hint. If you set the number to greater than 3, I believe you will get the number you want. 

-Gang

----- 原始邮件 ----
发件人： abhishek sharma <ab...@usc.edu>
收件人： Ravi Phulari <rp...@yahoo-inc.com>
抄   送： "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
发送日期： 2010/3/25 (周四) 2:04:35 上午
主   题： Re: posted again: how are the splits for map tasks computed?

Ravi,

On Wed, Mar 24, 2010 at 9:32 PM, Ravi Phulari <rp...@yahoo-inc.com> wrote:
> Hello Abhishek ,
>
> Unless you have modified conf/mapred-site.xml file MapReduce will 爑se
> configuration values specified in $HADOOP_HOME/src/mapred/mapred-default.xml
> In this configuration file mapred.map.tasks is configured as 2. And due to
> this your job is running 2 map tasks.
>
> <property>
> 牋<name>mapred.map.tasks</name>
> 牋<value>2</value>
> 牋<description>The default number of map tasks per job.
> 牋Ignored when mapred.job.tracker is "local".
> 牋</description>
> </property>

I tried setting mapred.map.tasks to 1 but the jobtracker still started
2 map tasks.

However, setting # map tasks to 1 for "loadgen" worked.

Cheers,
Abhishek

Re: posted again: how are the splits for map tasks computed?

Posted by abhishek sharma <ab...@usc.edu>.

Ravi,

On Wed, Mar 24, 2010 at 9:32 PM, Ravi Phulari <rp...@yahoo-inc.com> wrote:
> Hello Abhishek ,
>
> Unless you have modified conf/mapred-site.xml file MapReduce will  use
> configuration values specified in $HADOOP_HOME/src/mapred/mapred-default.xml
> In this configuration file mapred.map.tasks is configured as 2. And due to
> this your job is running 2 map tasks.
>
> <property>
>   <name>mapred.map.tasks</name>
>   <value>2</value>
>   <description>The default number of map tasks per job.
>   Ignored when mapred.job.tracker is "local".
>   </description>
> </property>

I tried setting mapred.map.tasks to 1 but the jobtracker still started
2 map tasks.

However, setting # map tasks to 1 for "loadgen" worked.

Cheers,
Abhishek

Re: posted again: how are the splits for map tasks computed?

Posted by Ravi Phulari <rp...@yahoo-inc.com>.

Hello Abhishek ,

Unless you have modified conf/mapred-site.xml file MapReduce will  use configuration values specified in $HADOOP_HOME/src/mapred/mapred-default.xml
In this configuration file mapred.map.tasks is configured as 2. And due to this your job is running 2 map tasks.

<property>
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>The default number of map tasks per job.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

Hope this helps.
-
Ravi

On 3/24/10 7:27 PM, "abhishek sharma" <ab...@usc.edu> wrote:

I realized that I made a mistake in my earlier post. So here is the correct one.

I have a job ("loadgen") with only 1 input (say) part-00000 of size
1368654 bytes.

So when I submit this job, I get the following output:

INFO mapred.FileInputFormat: Total input paths to process : 1

However, in the JobTracker log, I see the following entry:

 Split info for job:job_201003131110_0043 with 2 splits

and subsequently 2 map tasks are started to process these two splits.
The size of input splits to these 2 map tasks is 6843283. So the input
is divided equally into two splits.

My question is: Why are two map tasks created instead of one and why
is the combined size of the two splits greater than the size of my
input?

I also noticed that if I run the same job with 2 inputs (say)
part-00000 and part-00001, then only 2 map tasks are created.

To my knowledge, the number of map tasks should be the same as the
number of inputs.

Thanks,


Ravi
--

Re: posted again: how are the splits for map tasks computed?

Posted by Ravi Phulari <rp...@yahoo-inc.com>.

Hello Abhishek ,

Unless you have modified conf/mapred-site.xml file MapReduce will  use configuration values specified in $HADOOP_HOME/src/mapred/mapred-default.xml
In this configuration file mapred.map.tasks is configured as 2. And due to this your job is running 2 map tasks.

<property>
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>The default number of map tasks per job.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

Hope this helps.
-
Ravi

On 3/24/10 7:27 PM, "abhishek sharma" <ab...@usc.edu> wrote:

I realized that I made a mistake in my earlier post. So here is the correct one.

I have a job ("loadgen") with only 1 input (say) part-00000 of size
1368654 bytes.

So when I submit this job, I get the following output:

INFO mapred.FileInputFormat: Total input paths to process : 1

However, in the JobTracker log, I see the following entry:

 Split info for job:job_201003131110_0043 with 2 splits

and subsequently 2 map tasks are started to process these two splits.
The size of input splits to these 2 map tasks is 6843283. So the input
is divided equally into two splits.

My question is: Why are two map tasks created instead of one and why
is the combined size of the two splits greater than the size of my
input?

I also noticed that if I run the same job with 2 inputs (say)
part-00000 and part-00001, then only 2 map tasks are created.

To my knowledge, the number of map tasks should be the same as the
number of inputs.

Thanks,


Ravi
--