You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Pavan Kulkarni <pa...@gmail.com> on 2012/08/19 05:35:38 UTC

Significance of file.out.index during Shuffle Phase ?

Hi,

  I was trying to understand how exactly the reducers find out how to fetch
the data of its own partition from Map nodes.
During the executions of MapReduce, I see that *file.out* is created on Map
nodes, so my question is how does a reducer
know what part of file.out to fetch? Is the *file.out.index* play any role?
Any help is appreciated .Thanks



--With Regards
Pavan Kulkarni

Re: Significance of file.out.index during Shuffle Phase ?

Posted by Pavan Kulkarni <pa...@gmail.com>.

Arun,

  Yes got it now.  Well what I am trying to do is store the intermediate
data on a shared File System and create hardlinks to the
MapOutputs(file.out) spilled by the Map nodes. This eliminates the copy
phase of Shuffle stage.
 But now learning that the data for different reducers is partitioned
across the same file(file.out) creating hardlinks wouldn't serve the
purpose.Isn't it? Or is there a way to do it.?
Please correct me if am wrong at any assumption. Thanks

On Sun, Aug 19, 2012 at 10:54 PM, Arun C Murthy <ac...@hortonworks.com> wrote:

> You'll need to make significant changes MapTask.java which won't make it
> back to the mainline.
>
> Why? We had this before and quickly ran out of inodes on the local-disk.
> Think of large jobs with 10,000 maps * 1000 reduces -> that's 10M files.
>
> Arun
>
> On Aug 19, 2012, at 8:57 AM, Pavan Kulkarni wrote:
>
> > Ohh ,Thanks a lot Harsh. Exactly what I was looking for.
> > I wanted to create different file.out's for different reducers. Something
> > like
> > file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to
> do
> > this in the MapReduce program or I need to tweak some Hadoop source files
> > for that? Thanks.
> >
> > On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> Hey Pavan,
> >>
> >> Yes you've got it almost right on how file.out is served to each
> >> reducer. See the code at
> >>
> >>
> http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup
> >> (Method under L502:L565 that sends data for a specific
> >> reduce/partition ID (integer)).
> >>
> >> On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni <
> pavan.baburao@gmail.com>
> >> wrote:
> >>> Hi,
> >>>
> >>>  I was trying to understand how exactly the reducers find out how to
> >> fetch
> >>> the data of its own partition from Map nodes.
> >>> During the executions of MapReduce, I see that *file.out* is created on
> >> Map
> >>> nodes, so my question is how does a reducer
> >>> know what part of file.out to fetch? Is the *file.out.index* play any
> >> role?
> >>> Any help is appreciated .Thanks
> >>>
> >>>
> >>>
> >>> --With Regards
> >>> Pavan Kulkarni
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
> >
> >
> > --
> >
> > --With Regards
> > Pavan Kulkarni
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>


-- 

--With Regards
Pavan Kulkarni

Re: Significance of file.out.index during Shuffle Phase ?

Posted by Arun C Murthy <ac...@hortonworks.com>.

You'll need to make significant changes MapTask.java which won't make it back to the mainline.

Why? We had this before and quickly ran out of inodes on the local-disk. Think of large jobs with 10,000 maps * 1000 reduces -> that's 10M files.

Arun

On Aug 19, 2012, at 8:57 AM, Pavan Kulkarni wrote:

> Ohh ,Thanks a lot Harsh. Exactly what I was looking for.
> I wanted to create different file.out's for different reducers. Something
> like
> file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to do
> this in the MapReduce program or I need to tweak some Hadoop source files
> for that? Thanks.
> 
> On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <ha...@cloudera.com> wrote:
> 
>> Hey Pavan,
>> 
>> Yes you've got it almost right on how file.out is served to each
>> reducer. See the code at
>> 
>> http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup
>> (Method under L502:L565 that sends data for a specific
>> reduce/partition ID (integer)).
>> 
>> On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni <pa...@gmail.com>
>> wrote:
>>> Hi,
>>> 
>>>  I was trying to understand how exactly the reducers find out how to
>> fetch
>>> the data of its own partition from Map nodes.
>>> During the executions of MapReduce, I see that *file.out* is created on
>> Map
>>> nodes, so my question is how does a reducer
>>> know what part of file.out to fetch? Is the *file.out.index* play any
>> role?
>>> Any help is appreciated .Thanks
>>> 
>>> 
>>> 
>>> --With Regards
>>> Pavan Kulkarni
>> 
>> 
>> 
>> --
>> Harsh J
>> 
> 
> 
> 
> -- 
> 
> --With Regards
> Pavan Kulkarni

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

答复: 答复: Significance of file.out.index during Shuffle Phase ?

Posted by 俞盛朋 <th...@gmail.com>.

Oh sorry, I've misunderstood your question.  Forget what I've said please

-----邮件原件-----
发件人: Pavan Kulkarni [mailto:pavan.baburao@gmail.com] 
发送时间: 2012年8月20日 9:48
收件人: mapreduce-dev@hadoop.apache.org
主题: Re: 答复: Significance of file.out.index during Shuffle Phase ?

Hi,

  But I don't see those files during the executions.I only see file.out in
the job_ID/attempID/output/  folder.

On Sun, Aug 19, 2012 at 8:44 PM, 俞盛朋 <th...@gmail.com> wrote:

> The MapReduce program would create an output file for each reducer, 
> named "part-xxxxxx" by default
>
> -----邮件原件-----
> 发件人: Pavan Kulkarni [mailto:pavan.baburao@gmail.com]
> 发送时间: 2012年8月19日 23:58
> 收件人: mapreduce-dev@hadoop.apache.org
> 主题: Re: Significance of file.out.index during Shuffle Phase ?
>
> Ohh ,Thanks a lot Harsh. Exactly what I was looking for.
> I wanted to create different file.out's for different reducers. 
> Something like
> file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible 
> to do this in the MapReduce program or I need to tweak some Hadoop 
> source files for that? Thanks.
>
> On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <ha...@cloudera.com> wrote:
>
> > Hey Pavan,
> >
> > Yes you've got it almost right on how file.out is served to each 
> > reducer. See the code at
> >
> > http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-pr
> > oj 
> > ect/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main
> > /j ava/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup
> > (Method under L502:L565 that sends data for a specific 
> > reduce/partition ID (integer)).
> >
> > On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni 
> > <pa...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > >   I was trying to understand how exactly the reducers find out how 
> > > to
> > fetch
> > > the data of its own partition from Map nodes.
> > > During the executions of MapReduce, I see that *file.out* is 
> > > created on
> > Map
> > > nodes, so my question is how does a reducer know what part of 
> > > file.out to fetch? Is the *file.out.index* play any
> > role?
> > > Any help is appreciated .Thanks
> > >
> > >
> > >
> > > --With Regards
> > > Pavan Kulkarni
> >
> >
> >
> > --
> > Harsh J
> >
>
>
>
> --
>
> --With Regards
> Pavan Kulkarni
>
>


-- 

--With Regards
Pavan Kulkarni

Re: 答复: Significance of file.out.index during Shuffle Phase ?

Posted by Pavan Kulkarni <pa...@gmail.com>.

Hi,

  But I don't see those files during the executions.I only see file.out in
the job_ID/attempID/output/  folder.

On Sun, Aug 19, 2012 at 8:44 PM, 俞盛朋 <th...@gmail.com> wrote:

> The MapReduce program would create an output file for each reducer, named
> "part-xxxxxx" by default
>
> -----邮件原件-----
> 发件人: Pavan Kulkarni [mailto:pavan.baburao@gmail.com]
> 发送时间: 2012年8月19日 23:58
> 收件人: mapreduce-dev@hadoop.apache.org
> 主题: Re: Significance of file.out.index during Shuffle Phase ?
>
> Ohh ,Thanks a lot Harsh. Exactly what I was looking for.
> I wanted to create different file.out's for different reducers. Something
> like
> file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to do
> this in the MapReduce program or I need to tweak some Hadoop source files
> for that? Thanks.
>
> On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <ha...@cloudera.com> wrote:
>
> > Hey Pavan,
> >
> > Yes you've got it almost right on how file.out is served to each
> > reducer. See the code at
> >
> > http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-proj
> > ect/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/j
> > ava/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup
> > (Method under L502:L565 that sends data for a specific
> > reduce/partition ID (integer)).
> >
> > On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni
> > <pa...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > >   I was trying to understand how exactly the reducers find out how
> > > to
> > fetch
> > > the data of its own partition from Map nodes.
> > > During the executions of MapReduce, I see that *file.out* is created
> > > on
> > Map
> > > nodes, so my question is how does a reducer know what part of
> > > file.out to fetch? Is the *file.out.index* play any
> > role?
> > > Any help is appreciated .Thanks
> > >
> > >
> > >
> > > --With Regards
> > > Pavan Kulkarni
> >
> >
> >
> > --
> > Harsh J
> >
>
>
>
> --
>
> --With Regards
> Pavan Kulkarni
>
>


-- 

--With Regards
Pavan Kulkarni

答复: Significance of file.out.index during Shuffle Phase ?

Posted by 俞盛朋 <th...@gmail.com>.

The MapReduce program would create an output file for each reducer, named
"part-xxxxxx" by default

-----邮件原件-----
发件人: Pavan Kulkarni [mailto:pavan.baburao@gmail.com] 
发送时间: 2012年8月19日 23:58
收件人: mapreduce-dev@hadoop.apache.org
主题: Re: Significance of file.out.index during Shuffle Phase ?

Ohh ,Thanks a lot Harsh. Exactly what I was looking for.
I wanted to create different file.out's for different reducers. Something
like
file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to do
this in the MapReduce program or I need to tweak some Hadoop source files
for that? Thanks.

On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <ha...@cloudera.com> wrote:

> Hey Pavan,
>
> Yes you've got it almost right on how file.out is served to each 
> reducer. See the code at
>
> http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-proj
> ect/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/j
> ava/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup
> (Method under L502:L565 that sends data for a specific 
> reduce/partition ID (integer)).
>
> On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni 
> <pa...@gmail.com>
> wrote:
> > Hi,
> >
> >   I was trying to understand how exactly the reducers find out how 
> > to
> fetch
> > the data of its own partition from Map nodes.
> > During the executions of MapReduce, I see that *file.out* is created 
> > on
> Map
> > nodes, so my question is how does a reducer know what part of 
> > file.out to fetch? Is the *file.out.index* play any
> role?
> > Any help is appreciated .Thanks
> >
> >
> >
> > --With Regards
> > Pavan Kulkarni
>
>
>
> --
> Harsh J
>



-- 

--With Regards
Pavan Kulkarni

Re: Significance of file.out.index during Shuffle Phase ?

Posted by Pavan Kulkarni <pa...@gmail.com>.

Ohh ,Thanks a lot Harsh. Exactly what I was looking for.
I wanted to create different file.out's for different reducers. Something
like
file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to do
this in the MapReduce program or I need to tweak some Hadoop source files
for that? Thanks.

On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <ha...@cloudera.com> wrote:

> Hey Pavan,
>
> Yes you've got it almost right on how file.out is served to each
> reducer. See the code at
>
> http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup
> (Method under L502:L565 that sends data for a specific
> reduce/partition ID (integer)).
>
> On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni <pa...@gmail.com>
> wrote:
> > Hi,
> >
> >   I was trying to understand how exactly the reducers find out how to
> fetch
> > the data of its own partition from Map nodes.
> > During the executions of MapReduce, I see that *file.out* is created on
> Map
> > nodes, so my question is how does a reducer
> > know what part of file.out to fetch? Is the *file.out.index* play any
> role?
> > Any help is appreciated .Thanks
> >
> >
> >
> > --With Regards
> > Pavan Kulkarni
>
>
>
> --
> Harsh J
>



-- 

--With Regards
Pavan Kulkarni

Re: Significance of file.out.index during Shuffle Phase ?

Posted by Harsh J <ha...@cloudera.com>.

Hey Pavan,

Yes you've got it almost right on how file.out is served to each
reducer. See the code at
http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup
(Method under L502:L565 that sends data for a specific
reduce/partition ID (integer)).

On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni <pa...@gmail.com> wrote:
> Hi,
>
>   I was trying to understand how exactly the reducers find out how to fetch
> the data of its own partition from Map nodes.
> During the executions of MapReduce, I see that *file.out* is created on Map
> nodes, so my question is how does a reducer
> know what part of file.out to fetch? Is the *file.out.index* play any role?
> Any help is appreciated .Thanks
>
>
>
> --With Regards
> Pavan Kulkarni

-- 
Harsh J