You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Dina Said <di...@gmail.com> on 2008/04/19 22:55:10 UTC

Combine previous Map Results

Dear all

Suppose that I have files that have intermediate key values and I want
to combine these intermediate keys values with a new MapReduce task. I
want this MapReduce task to combine during the reduce stage the
intermediate key values it generates with the intermediate key values I
already have.

Any ideas?

Dina

Re: Combine previous Map Results

Posted by Dina Said <di...@gmail.com>.

Thanks Joydeep.
I am sorry for not recognizing that in the first place.

Joydeep Sen Sarma wrote:
> Ummm .. was in the initial reply:
>  
>   
>> u can write a mapper that can decide the map logic based on the input
>>     
> file
>   
>> name (look for the jobconf variable map.input.file in Java - or the 
>> environment variable map_input_file in hadoop streaming).
>>     
>
> -----Original Message-----
> From: Dina Said [mailto:dinasaid@gmail.com] 
> Sent: Friday, April 25, 2008 5:42 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Combine previous Map Results
>
> Thanks Ted
>
> But how can I specify that the inputs come from the following files
> should be processed by f_a and the other inputs should be processed by
> f_b?
> Or how can I check the input type?
>
> The input to the map is in the format of inputsplits as far as I know
>
> Dina
>
> Ted Dunning wrote:
>   
>> You can only have one map function.
>>
>> But that function can decide which sort of thing to do based on which
>>     
> input
>   
>> it is given.  That allows input of type A to be processed with map
>>     
> funtion
>   
>> f_a and input of type B to be processed with map function f_b.
>>
>>
>>
>>
>> On 4/25/08 4:43 PM, "Dina Said" <di...@gmail.com> wrote:
>>
>>   
>>     
>>> Thanks Joydeep for your reply.
>>>
>>> But is there a possibility to have two or more Map tasks and a single
>>> reduce task?
>>> I want the reduce task to work on all the intermediate keys produced
>>> from these Map tasks.
>>>
>>> I am sorry I am a new baby in Map-Reduce but from my first reading:
>>> I can see that we can define only one Map task
>>>
>>> Thanks
>>> Dina
>>>
>>>
>>> Joydeep Sen Sarma wrote:
>>>     
>>>       
>>>> if one weren't thinking about performance - then the second
>>>>         
> map-reduce task
>   
>>>> would have to process both the data sets (the intermediate data and
>>>>         
> the new
>   
>>>> data). For the existing intermediate data - you want to do an
>>>>         
> identity map
>   
>>>> and for the new data - whatever map logic you have. u can write a
>>>>         
> mapper that
>   
>>>> can decide the map logic based on the input file name (look for the
>>>>         
> jobconf
>   
>>>> variable map.input.file in Java - or the environment variable
>>>>         
> map_input_file
>   
>>>> in hadoop streaming).
>>>>
>>>> if one were thinking about performance - then one would argue that
>>>>         
> re-sorting
>   
>>>> the existing intermediate data (as would happen in the simple
>>>>         
> solution) is
>   
>>>> pointless (it's already sorted by the desired key). if this is a
>>>>         
> concern -
>   
>>>> the only thing that's available right now (afaik) is a feature
>>>>         
> described in
>   
>>>> hadoop-2085. (you would have to map-reduce the new data set only and
>>>>         
> then
>   
>>>> join the old and new data using map-side joins described in this
>>>>         
> jira - this
>   
>>>> would require a third map-reduce task).
>>>>
>>>>
>>>> (one could argue that if there was an option to skip map-side
>>>>         
> sorting on a
>   
>>>> per-file level - that would be perfect. one would skip map-side
>>>>         
> sorts of the
>   
>>>> old data and only sort the new data - and the reducer would merge
>>>>         
> the two).
>   
>>>> -----Original Message-----
>>>> From: Dina Said [mailto:dinasaid@gmail.com]
>>>> Sent: Sat 4/19/2008 1:55 PM
>>>> To: core-user@hadoop.apache.org
>>>> Subject: Combine previous Map Results
>>>>  
>>>> Dear all
>>>>
>>>> Suppose that I have files that have intermediate key values and I
>>>>         
> want
>   
>>>> to combine these intermediate keys values with a new MapReduce task.
>>>>         
> I
>   
>>>> want this MapReduce task to combine during the reduce stage the
>>>> intermediate key values it generates with the intermediate key
>>>>         
> values I
>   
>>>> already have.
>>>>
>>>> Any ideas?
>>>>
>>>> Dina
>>>>
>>>>
>>>>   
>>>>       
>>>>         
>>   
>>     
>
>
>

RE: Combine previous Map Results

Posted by Joydeep Sen Sarma <js...@facebook.com>.

Ummm .. was in the initial reply:
 
> u can write a mapper that can decide the map logic based on the input
file
> name (look for the jobconf variable map.input.file in Java - or the 
> environment variable map_input_file in hadoop streaming).

-----Original Message-----
From: Dina Said [mailto:dinasaid@gmail.com] 
Sent: Friday, April 25, 2008 5:42 PM
To: core-user@hadoop.apache.org
Subject: Re: Combine previous Map Results

Thanks Ted

But how can I specify that the inputs come from the following files
should be processed by f_a and the other inputs should be processed by
f_b?
Or how can I check the input type?

The input to the map is in the format of inputsplits as far as I know

Dina

Ted Dunning wrote:
> You can only have one map function.
>
> But that function can decide which sort of thing to do based on which
input
> it is given.  That allows input of type A to be processed with map
funtion
> f_a and input of type B to be processed with map function f_b.
>
>
>
>
> On 4/25/08 4:43 PM, "Dina Said" <di...@gmail.com> wrote:
>
>   
>> Thanks Joydeep for your reply.
>>
>> But is there a possibility to have two or more Map tasks and a single
>> reduce task?
>> I want the reduce task to work on all the intermediate keys produced
>> from these Map tasks.
>>
>> I am sorry I am a new baby in Map-Reduce but from my first reading:
>> I can see that we can define only one Map task
>>
>> Thanks
>> Dina
>>
>>
>> Joydeep Sen Sarma wrote:
>>     
>>> if one weren't thinking about performance - then the second
map-reduce task
>>> would have to process both the data sets (the intermediate data and
the new
>>> data). For the existing intermediate data - you want to do an
identity map
>>> and for the new data - whatever map logic you have. u can write a
mapper that
>>> can decide the map logic based on the input file name (look for the
jobconf
>>> variable map.input.file in Java - or the environment variable
map_input_file
>>> in hadoop streaming).
>>>
>>> if one were thinking about performance - then one would argue that
re-sorting
>>> the existing intermediate data (as would happen in the simple
solution) is
>>> pointless (it's already sorted by the desired key). if this is a
concern -
>>> the only thing that's available right now (afaik) is a feature
described in
>>> hadoop-2085. (you would have to map-reduce the new data set only and
then
>>> join the old and new data using map-side joins described in this
jira - this
>>> would require a third map-reduce task).
>>>
>>>
>>> (one could argue that if there was an option to skip map-side
sorting on a
>>> per-file level - that would be perfect. one would skip map-side
sorts of the
>>> old data and only sort the new data - and the reducer would merge
the two).
>>>
>>>
>>> -----Original Message-----
>>> From: Dina Said [mailto:dinasaid@gmail.com]
>>> Sent: Sat 4/19/2008 1:55 PM
>>> To: core-user@hadoop.apache.org
>>> Subject: Combine previous Map Results
>>>  
>>> Dear all
>>>
>>> Suppose that I have files that have intermediate key values and I
want
>>> to combine these intermediate keys values with a new MapReduce task.
I
>>> want this MapReduce task to combine during the reduce stage the
>>> intermediate key values it generates with the intermediate key
values I
>>> already have.
>>>
>>> Any ideas?
>>>
>>> Dina
>>>
>>>
>>>   
>>>       
>
>
>

Re: Combine previous Map Results

Posted by Dina Said <di...@gmail.com>.

Thanks Ted

But how can I specify that the inputs come from the following files
should be processed by f_a and the other inputs should be processed by f_b?
Or how can I check the input type?

The input to the map is in the format of inputsplits as far as I know

Dina

Ted Dunning wrote:
> You can only have one map function.
>
> But that function can decide which sort of thing to do based on which input
> it is given.  That allows input of type A to be processed with map funtion
> f_a and input of type B to be processed with map function f_b.
>
>
>
>
> On 4/25/08 4:43 PM, "Dina Said" <di...@gmail.com> wrote:
>
>   
>> Thanks Joydeep for your reply.
>>
>> But is there a possibility to have two or more Map tasks and a single
>> reduce task?
>> I want the reduce task to work on all the intermediate keys produced
>> from these Map tasks.
>>
>> I am sorry I am a new baby in Map-Reduce but from my first reading:
>> I can see that we can define only one Map task
>>
>> Thanks
>> Dina
>>
>>
>> Joydeep Sen Sarma wrote:
>>     
>>> if one weren't thinking about performance - then the second map-reduce task
>>> would have to process both the data sets (the intermediate data and the new
>>> data). For the existing intermediate data - you want to do an identity map
>>> and for the new data - whatever map logic you have. u can write a mapper that
>>> can decide the map logic based on the input file name (look for the jobconf
>>> variable map.input.file in Java - or the environment variable map_input_file
>>> in hadoop streaming).
>>>
>>> if one were thinking about performance - then one would argue that re-sorting
>>> the existing intermediate data (as would happen in the simple solution) is
>>> pointless (it's already sorted by the desired key). if this is a concern -
>>> the only thing that's available right now (afaik) is a feature described in
>>> hadoop-2085. (you would have to map-reduce the new data set only and then
>>> join the old and new data using map-side joins described in this jira - this
>>> would require a third map-reduce task).
>>>
>>>
>>> (one could argue that if there was an option to skip map-side sorting on a
>>> per-file level - that would be perfect. one would skip map-side sorts of the
>>> old data and only sort the new data - and the reducer would merge the two).
>>>
>>>
>>> -----Original Message-----
>>> From: Dina Said [mailto:dinasaid@gmail.com]
>>> Sent: Sat 4/19/2008 1:55 PM
>>> To: core-user@hadoop.apache.org
>>> Subject: Combine previous Map Results
>>>  
>>> Dear all
>>>
>>> Suppose that I have files that have intermediate key values and I want
>>> to combine these intermediate keys values with a new MapReduce task. I
>>> want this MapReduce task to combine during the reduce stage the
>>> intermediate key values it generates with the intermediate key values I
>>> already have.
>>>
>>> Any ideas?
>>>
>>> Dina
>>>
>>>
>>>   
>>>       
>
>
>

Re: Combine previous Map Results

Posted by Ted Dunning <td...@veoh.com>.


You can only have one map function.

But that function can decide which sort of thing to do based on which input
it is given.  That allows input of type A to be processed with map funtion
f_a and input of type B to be processed with map function f_b.




On 4/25/08 4:43 PM, "Dina Said" <di...@gmail.com> wrote:

> Thanks Joydeep for your reply.
> 
> But is there a possibility to have two or more Map tasks and a single
> reduce task?
> I want the reduce task to work on all the intermediate keys produced
> from these Map tasks.
> 
> I am sorry I am a new baby in Map-Reduce but from my first reading:
> I can see that we can define only one Map task
> 
> Thanks
> Dina
> 
> 
> Joydeep Sen Sarma wrote:
>> if one weren't thinking about performance - then the second map-reduce task
>> would have to process both the data sets (the intermediate data and the new
>> data). For the existing intermediate data - you want to do an identity map
>> and for the new data - whatever map logic you have. u can write a mapper that
>> can decide the map logic based on the input file name (look for the jobconf
>> variable map.input.file in Java - or the environment variable map_input_file
>> in hadoop streaming).
>> 
>> if one were thinking about performance - then one would argue that re-sorting
>> the existing intermediate data (as would happen in the simple solution) is
>> pointless (it's already sorted by the desired key). if this is a concern -
>> the only thing that's available right now (afaik) is a feature described in
>> hadoop-2085. (you would have to map-reduce the new data set only and then
>> join the old and new data using map-side joins described in this jira - this
>> would require a third map-reduce task).
>> 
>> 
>> (one could argue that if there was an option to skip map-side sorting on a
>> per-file level - that would be perfect. one would skip map-side sorts of the
>> old data and only sort the new data - and the reducer would merge the two).
>> 
>> 
>> -----Original Message-----
>> From: Dina Said [mailto:dinasaid@gmail.com]
>> Sent: Sat 4/19/2008 1:55 PM
>> To: core-user@hadoop.apache.org
>> Subject: Combine previous Map Results
>>  
>> Dear all
>> 
>> Suppose that I have files that have intermediate key values and I want
>> to combine these intermediate keys values with a new MapReduce task. I
>> want this MapReduce task to combine during the reduce stage the
>> intermediate key values it generates with the intermediate key values I
>> already have.
>> 
>> Any ideas?
>> 
>> Dina
>> 
>> 
>>   
>

Re: Combine previous Map Results

Posted by Dina Said <di...@gmail.com>.

Thanks Joydeep for your reply.

But is there a possibility to have two or more Map tasks and a single
reduce task?
I want the reduce task to work on all the intermediate keys produced
from these Map tasks.

I am sorry I am a new baby in Map-Reduce but from my first reading:
I can see that we can define only one Map task

Thanks
Dina


Joydeep Sen Sarma wrote:
> if one weren't thinking about performance - then the second map-reduce task would have to process both the data sets (the intermediate data and the new data). For the existing intermediate data - you want to do an identity map and for the new data - whatever map logic you have. u can write a mapper that can decide the map logic based on the input file name (look for the jobconf variable map.input.file in Java - or the environment variable map_input_file in hadoop streaming).
>
> if one were thinking about performance - then one would argue that re-sorting the existing intermediate data (as would happen in the simple solution) is pointless (it's already sorted by the desired key). if this is a concern - the only thing that's available right now (afaik) is a feature described in hadoop-2085. (you would have to map-reduce the new data set only and then join the old and new data using map-side joins described in this jira - this would require a third map-reduce task).
>
>
> (one could argue that if there was an option to skip map-side sorting on a per-file level - that would be perfect. one would skip map-side sorts of the old data and only sort the new data - and the reducer would merge the two).
>
>
> -----Original Message-----
> From: Dina Said [mailto:dinasaid@gmail.com]
> Sent: Sat 4/19/2008 1:55 PM
> To: core-user@hadoop.apache.org
> Subject: Combine previous Map Results
>  
> Dear all
>
> Suppose that I have files that have intermediate key values and I want
> to combine these intermediate keys values with a new MapReduce task. I
> want this MapReduce task to combine during the reduce stage the
> intermediate key values it generates with the intermediate key values I
> already have.
>
> Any ideas?
>
> Dina
>
>
>

RE: Combine previous Map Results

Posted by Joydeep Sen Sarma <js...@facebook.com>.

if one weren't thinking about performance - then the second map-reduce task would have to process both the data sets (the intermediate data and the new data). For the existing intermediate data - you want to do an identity map and for the new data - whatever map logic you have. u can write a mapper that can decide the map logic based on the input file name (look for the jobconf variable map.input.file in Java - or the environment variable map_input_file in hadoop streaming).

if one were thinking about performance - then one would argue that re-sorting the existing intermediate data (as would happen in the simple solution) is pointless (it's already sorted by the desired key). if this is a concern - the only thing that's available right now (afaik) is a feature described in hadoop-2085. (you would have to map-reduce the new data set only and then join the old and new data using map-side joins described in this jira - this would require a third map-reduce task).


(one could argue that if there was an option to skip map-side sorting on a per-file level - that would be perfect. one would skip map-side sorts of the old data and only sort the new data - and the reducer would merge the two).


-----Original Message-----
From: Dina Said [mailto:dinasaid@gmail.com]
Sent: Sat 4/19/2008 1:55 PM
To: core-user@hadoop.apache.org
Subject: Combine previous Map Results
 
Dear all

Suppose that I have files that have intermediate key values and I want
to combine these intermediate keys values with a new MapReduce task. I
want this MapReduce task to combine during the reduce stage the
intermediate key values it generates with the intermediate key values I
already have.

Any ideas?

Dina