You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Sahana Bhat <sa...@gmail.com> on 2011/09/07 11:07:09 UTC

Multiple Mappers and One Reducer

Hi,

         Is it possible to have multiple mappers  where each mapper is
operating on a different input file and whose result (which is a key value
pair from different mappers) is processed by a single reducer?

Regards,
Sahana

Re: Multiple Mappers and One Reducer

Posted by Harsh J <ha...@cloudera.com>.

Praveenesh,

The JIRA https://issues.apache.org/jira/browse/MAPREDUCE-369
introduced it and carries a patch that I think would apply without
much trouble on your cluster's sources. You can mail me directly if
you need help applying a patch.

Alternatively, you can do something like downloading 0.21 where is is
found, and then pulling out the particular source files and adding
them to your project's source trees with their license and package
names intact (which I think is a legal requirement? others can correct
me if I'm wrong), and then you can utilize it as a regular import.

HTH.

On Wed, Sep 7, 2011 at 3:34 PM, praveenesh kumar <pr...@gmail.com> wrote:
> Harsh, Can you please tell how can we use MultipleInputs using Job Object on
> hadoop 0.20.2. As you can see, in MultipleInputs, its using JobConf object.
> I want to use Job object as mentioned in new hadoop 0.21 API.
> I remember you talked about pulling out things from new API and add it into
> out project.
> Can you please add more light how can we do this ?
>
> Thanks ,
> Praveenesh.
>
> On Wed, Sep 7, 2011 at 2:57 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Sahana,
>>
>> Yes this is possible as well. Please take a look at the MultipleInputs
>> API @
>> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleInputs.html
>>
>> It will allow you to add a path each with its own mapper
>> implementation, and you can then have a common reducer since the key
>> is what you'll be matching against.
>>
>> On Wed, Sep 7, 2011 at 3:02 PM, Sahana Bhat <sa...@gmail.com> wrote:
>> > Hi,
>> >         I understand that given a file, the file is split across 'n'
>> > mapper
>> > instances, which is the normal case.
>> > The scenario i have is :
>> > 1. Two files which are not totally identical in terms of number of
>> > columns
>> > (but have data that is similar in a few columns) need to be processed
>> > and
>> > after computation a single output file has to be generated.
>> > Note : CV - computedvalue
>> > File1 belonging to one dataset has data for :
>> > Date,counter1,counter2, CV1,CV2
>> > File2 belonging to another dataset has data for :
>> > Date,counter1,counter2,CV3,CV4,CV5
>> > Computation to be carried out on these two files is :
>> > CV6 =(CV1*CV5)/100
>> > And the final emitted output file should have data in the sequence:
>> > Date,counter1,counter2,CV6
>> > The idea is to have two mappers (not instances) run on each of the file,
>> > and
>> > a single reducer that emits the final result file.
>> > Thanks,
>> > Sahana
>> > On Wed, Sep 7, 2011 at 2:40 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Sahana,
>> >>
>> >> Yes. But, isn't that how it is normally? What makes you question this
>> >> capability?
>> >>
>> >> On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat <sa...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >          Is it possible to have multiple mappers  where each mapper
>> >> > is
>> >> > operating on a different input file and whose result (which is a key
>> >> > value
>> >> > pair from different mappers) is processed by a single reducer?
>> >> > Regards,
>> >> > Sahana
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Multiple Mappers and One Reducer

Posted by praveenesh kumar <pr...@gmail.com>.

Harsh, Can you please tell how can we use MultipleInputs using Job Object on
hadoop 0.20.2. As you can see, in MultipleInputs, its using JobConf object.
I want to use Job object as mentioned in new hadoop 0.21 API.
I remember you talked about pulling out things from new API and add it into
out project.
Can you please add more light how can we do this ?

Thanks ,
Praveenesh.

On Wed, Sep 7, 2011 at 2:57 AM, Harsh J <ha...@cloudera.com> wrote:

> Sahana,
>
> Yes this is possible as well. Please take a look at the MultipleInputs
> API @
> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleInputs.html
>
> It will allow you to add a path each with its own mapper
> implementation, and you can then have a common reducer since the key
> is what you'll be matching against.
>
> On Wed, Sep 7, 2011 at 3:02 PM, Sahana Bhat <sa...@gmail.com> wrote:
> > Hi,
> >         I understand that given a file, the file is split across 'n'
> mapper
> > instances, which is the normal case.
> > The scenario i have is :
> > 1. Two files which are not totally identical in terms of number of
> columns
> > (but have data that is similar in a few columns) need to be processed and
> > after computation a single output file has to be generated.
> > Note : CV - computedvalue
> > File1 belonging to one dataset has data for :
> > Date,counter1,counter2, CV1,CV2
> > File2 belonging to another dataset has data for :
> > Date,counter1,counter2,CV3,CV4,CV5
> > Computation to be carried out on these two files is :
> > CV6 =(CV1*CV5)/100
> > And the final emitted output file should have data in the sequence:
> > Date,counter1,counter2,CV6
> > The idea is to have two mappers (not instances) run on each of the file,
> and
> > a single reducer that emits the final result file.
> > Thanks,
> > Sahana
> > On Wed, Sep 7, 2011 at 2:40 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Sahana,
> >>
> >> Yes. But, isn't that how it is normally? What makes you question this
> >> capability?
> >>
> >> On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat <sa...@gmail.com>
> wrote:
> >> > Hi,
> >> >          Is it possible to have multiple mappers  where each mapper is
> >> > operating on a different input file and whose result (which is a key
> >> > value
> >> > pair from different mappers) is processed by a single reducer?
> >> > Regards,
> >> > Sahana
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Multiple Mappers and One Reducer

Posted by Harsh J <ha...@cloudera.com>.

Sahana,

Yes this is possible as well. Please take a look at the MultipleInputs
API @ http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleInputs.html

It will allow you to add a path each with its own mapper
implementation, and you can then have a common reducer since the key
is what you'll be matching against.

On Wed, Sep 7, 2011 at 3:02 PM, Sahana Bhat <sa...@gmail.com> wrote:
> Hi,
>         I understand that given a file, the file is split across 'n' mapper
> instances, which is the normal case.
> The scenario i have is :
> 1. Two files which are not totally identical in terms of number of columns
> (but have data that is similar in a few columns) need to be processed and
> after computation a single output file has to be generated.
> Note : CV - computedvalue
> File1 belonging to one dataset has data for :
> Date,counter1,counter2, CV1,CV2
> File2 belonging to another dataset has data for :
> Date,counter1,counter2,CV3,CV4,CV5
> Computation to be carried out on these two files is :
> CV6 =(CV1*CV5)/100
> And the final emitted output file should have data in the sequence:
> Date,counter1,counter2,CV6
> The idea is to have two mappers (not instances) run on each of the file, and
> a single reducer that emits the final result file.
> Thanks,
> Sahana
> On Wed, Sep 7, 2011 at 2:40 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Sahana,
>>
>> Yes. But, isn't that how it is normally? What makes you question this
>> capability?
>>
>> On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat <sa...@gmail.com> wrote:
>> > Hi,
>> >          Is it possible to have multiple mappers  where each mapper is
>> > operating on a different input file and whose result (which is a key
>> > value
>> > pair from different mappers) is processed by a single reducer?
>> > Regards,
>> > Sahana
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Multiple Mappers and One Reducer

Posted by Sudharsan Sampath <su...@gmail.com>.

Hi,

Its possible by setting the num of reduce tasks to be 1. Based on your
example, it looks like u need to group ur records based on "Date, counter1
and counter2". So that should go in the logic of building your key for your
map o/p.

Thanks
Sudhan S

On Wed, Sep 7, 2011 at 3:02 PM, Sahana Bhat <sa...@gmail.com> wrote:

> Hi,
>
>         I understand that given a file, the file is split across 'n' mapper
> instances, which is the normal case.
>
> The scenario i have is :
> 1. Two files which are not totally identical in terms of number of columns
> (but have data that is similar in a few columns) need to be processed and
> after computation a single output file has to be generated.
>
> Note : CV - computedvalue
>
> File1 belonging to one dataset has data for :
> Date,counter1,counter2, CV1,CV2
>
> File2 belonging to another dataset has data for :
> Date,counter1,counter2,CV3,CV4,CV5
>
> Computation to be carried out on these two files is :
> CV6 =(CV1*CV5)/100
>
> And the final emitted output file should have data in the sequence:
> Date,counter1,counter2,CV6
>
> The idea is to have two mappers (not instances) run on each of the file,
> and a single reducer that emits the final result file.
>
> Thanks,
> Sahana
>
> On Wed, Sep 7, 2011 at 2:40 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Sahana,
>>
>> Yes. But, isn't that how it is normally? What makes you question this
>> capability?
>>
>> On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat <sa...@gmail.com> wrote:
>> > Hi,
>> >          Is it possible to have multiple mappers  where each mapper is
>> > operating on a different input file and whose result (which is a key
>> value
>> > pair from different mappers) is processed by a single reducer?
>> > Regards,
>> > Sahana
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Multiple Mappers and One Reducer

Posted by Sahana Bhat <sa...@gmail.com>.

Hi,

        I understand that given a file, the file is split across 'n' mapper
instances, which is the normal case.

The scenario i have is :
1. Two files which are not totally identical in terms of number of columns
(but have data that is similar in a few columns) need to be processed and
after computation a single output file has to be generated.

Note : CV - computedvalue

File1 belonging to one dataset has data for :
Date,counter1,counter2, CV1,CV2

File2 belonging to another dataset has data for :
Date,counter1,counter2,CV3,CV4,CV5

Computation to be carried out on these two files is :
CV6 =(CV1*CV5)/100

And the final emitted output file should have data in the sequence:
Date,counter1,counter2,CV6

The idea is to have two mappers (not instances) run on each of the file, and
a single reducer that emits the final result file.

Thanks,
Sahana

On Wed, Sep 7, 2011 at 2:40 PM, Harsh J <ha...@cloudera.com> wrote:

> Sahana,
>
> Yes. But, isn't that how it is normally? What makes you question this
> capability?
>
> On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat <sa...@gmail.com> wrote:
> > Hi,
> >          Is it possible to have multiple mappers  where each mapper is
> > operating on a different input file and whose result (which is a key
> value
> > pair from different mappers) is processed by a single reducer?
> > Regards,
> > Sahana
>
>
>
> --
> Harsh J
>

Re: Multiple Mappers and One Reducer

Posted by Harsh J <ha...@cloudera.com>.

Sahana,

Yes. But, isn't that how it is normally? What makes you question this
capability?

On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat <sa...@gmail.com> wrote:
> Hi,
>          Is it possible to have multiple mappers  where each mapper is
> operating on a different input file and whose result (which is a key value
> pair from different mappers) is processed by a single reducer?
> Regards,
> Sahana



-- 
Harsh J