You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Jeff Zhang <zj...@gmail.com> on 2010/06/17 04:53:31 UTC

What is the reason for putting the output of one mapper task into one file ?

Hi all,

I check the source code of Mapper Task, it seems that the output of
one mapper task is one data file and one index file. And reducer task
will fetch part of the output of mapper.
I am wondering why not putting the output of mapper into n files (n is
the reducer number), since mapper task knows the Partitioner. and the
logic will be much easier. Is there any performance consideration for
putting the output into one file ? Thanks.


-- 
Best Regards

Jeff Zhang

Re: What is the reason for putting the output of one mapper task into one file ?

Posted by Jeff Zhang <zj...@gmail.com>.
Arun, thanks for your reply.



On Thu, Jun 17, 2010 at 7:57 AM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
> Not performance, but stability.
>
> We used to put the output of maps in r files (where r is number of reduces)
> and quickly found out that the local disk would run out of inodes after
> running a few mid-to-large sized jobs (in terms or m * r).
>
> https://issues.apache.org/jira/browse/HADOOP-331
>
> Arun
>
> On Jun 16, 2010, at 7:53 PM, Jeff Zhang wrote:
>
>> Hi all,
>>
>> I check the source code of Mapper Task, it seems that the output of
>> one mapper task is one data file and one index file. And reducer task
>> will fetch part of the output of mapper.
>> I am wondering why not putting the output of mapper into n files (n is
>> the reducer number), since mapper task knows the Partitioner. and the
>> logic will be much easier. Is there any performance consideration for
>> putting the output into one file ? Thanks.
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>
>



-- 
Best Regards

Jeff Zhang

Re: What is the reason for putting the output of one mapper task into one file ?

Posted by Arun C Murthy <ac...@yahoo-inc.com>.
Not performance, but stability.

We used to put the output of maps in r files (where r is number of  
reduces) and quickly found out that the local disk would run out of  
inodes after running a few mid-to-large sized jobs (in terms or m * r).

https://issues.apache.org/jira/browse/HADOOP-331

Arun

On Jun 16, 2010, at 7:53 PM, Jeff Zhang wrote:

> Hi all,
>
> I check the source code of Mapper Task, it seems that the output of
> one mapper task is one data file and one index file. And reducer task
> will fetch part of the output of mapper.
> I am wondering why not putting the output of mapper into n files (n is
> the reducer number), since mapper task knows the Partitioner. and the
> logic will be much easier. Is there any performance consideration for
> putting the output into one file ? Thanks.
>
>
> -- 
> Best Regards
>
> Jeff Zhang