You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Jeff Zhang <zj...@gmail.com> on 2010/06/17 04:53:31 UTC
What is the reason for putting the output of one mapper task into one
file ?
Hi all,
I check the source code of Mapper Task, it seems that the output of
one mapper task is one data file and one index file. And reducer task
will fetch part of the output of mapper.
I am wondering why not putting the output of mapper into n files (n is
the reducer number), since mapper task knows the Partitioner. and the
logic will be much easier. Is there any performance consideration for
putting the output into one file ? Thanks.
--
Best Regards
Jeff Zhang
Re: What is the reason for putting the output of one mapper task into
one file ?
Posted by Jeff Zhang <zj...@gmail.com>.
Arun, thanks for your reply.
On Thu, Jun 17, 2010 at 7:57 AM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
> Not performance, but stability.
>
> We used to put the output of maps in r files (where r is number of reduces)
> and quickly found out that the local disk would run out of inodes after
> running a few mid-to-large sized jobs (in terms or m * r).
>
> https://issues.apache.org/jira/browse/HADOOP-331
>
> Arun
>
> On Jun 16, 2010, at 7:53 PM, Jeff Zhang wrote:
>
>> Hi all,
>>
>> I check the source code of Mapper Task, it seems that the output of
>> one mapper task is one data file and one index file. And reducer task
>> will fetch part of the output of mapper.
>> I am wondering why not putting the output of mapper into n files (n is
>> the reducer number), since mapper task knows the Partitioner. and the
>> logic will be much easier. Is there any performance consideration for
>> putting the output into one file ? Thanks.
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>
>
--
Best Regards
Jeff Zhang
Re: What is the reason for putting the output of one mapper task into one file ?
Posted by Arun C Murthy <ac...@yahoo-inc.com>.
Not performance, but stability.
We used to put the output of maps in r files (where r is number of
reduces) and quickly found out that the local disk would run out of
inodes after running a few mid-to-large sized jobs (in terms or m * r).
https://issues.apache.org/jira/browse/HADOOP-331
Arun
On Jun 16, 2010, at 7:53 PM, Jeff Zhang wrote:
> Hi all,
>
> I check the source code of Mapper Task, it seems that the output of
> one mapper task is one data file and one index file. And reducer task
> will fetch part of the output of mapper.
> I am wondering why not putting the output of mapper into n files (n is
> the reducer number), since mapper task knows the Partitioner. and the
> logic will be much easier. Is there any performance consideration for
> putting the output into one file ? Thanks.
>
>
> --
> Best Regards
>
> Jeff Zhang