You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by some speed <sp...@gmail.com> on 2008/11/10 06:10:00 UTC

reading input for a map function from 2 different files?

I was wondering if it was possible to read the input for a map function from
2 different files:

1st file ---> user-input file from a particular location(path)
2nd file=---> A resultant file (has just one <key,value> pair) from a
previous MapReduce job. (I am implementing a chain MapReduce function)

Now, for every <key,value> pair in the user-input file, I would like to use
the same <key,value> pair from the 2nd file for some calculations.

Is it possible for me to do so? Can someone guide me in the right direction
please?


Thanks!

Re: reading input for a map function from 2 different files?

Posted by some speed <sp...@gmail.com>.

But there is another problem. Is it possible to set the config variable from
Reduce?? I mean once I get the average value in the 1st M-R, I need to
initialise the Config variable to it. Can this be done in Reduce/Map
methods?

 As far as i know, we can SET the config variable only while configuring the
job. And we can only RETRIEVE these variables at the start of the map/reduce
methods. So, while setting up the second job, how do I get the value of the
Reduce's output ?

I apologize if this happens to be a trivial Question...but am pretty much
confused at this point and unable to proceed.

Thanks.

On Sun, Nov 16, 2008 at 1:49 AM, some speed <sp...@gmail.com> wrote:

> Thank you all!!
> What Milind has said will do the trick for me as a I need accurate values
> for the deviation.
> Passing variables between jobs by means of the Configure method and
> getInt/setInt will make things a lot easier!
>
>
> On Wed, Nov 12, 2008 at 7:07 PM, Milind Bhandarkar <mi...@yahoo-inc.com>wrote:
>
>> Since you need to pass only one number (average) to all mappers, you can
>> pass it through jobconf with a config variable defined by you, say
>> "my.average"..
>>
>> - milind
>>
>>
>> On 11/11/08 8:25 PM, "some speed" <sp...@gmail.com> wrote:
>>
>> > Thanks for the response. What I am trying is to do is finding the
>> average
>> > and then the standard deviation for a very large set (say a million) of
>> > numbers. The result would be used in further calculations.
>> > I have got the average from the first map-reduce chain. now i need to
>> read
>> > this average as well as the set of numbers to calculate the standard
>> > deviation.  so one file would have the input set and the other
>> "resultant"
>> > file would have just the average.
>> > Please do tell me in case there is a better way of doing things than
>> what i
>> > am doing. Any input/suggestion is appreciated.:)
>> >
>> >
>> >
>> > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <am...@yahoo-inc.com>
>> wrote:
>> >
>> >> Amar Kamat wrote:
>> >>
>> >>> some speed wrote:
>> >>>
>> >>>> I was wondering if it was possible to read the input for a map
>> function
>> >>>> from
>> >>>> 2 different files:
>> >>>>  1st file ---> user-input file from a particular location(path)
>> >>>>
>> >>> Is the input/user file sorted? If yes then you can use "map-side join"
>> for
>> >> performance reasons. See org.apache.hadoop.mapred.join for more
>> details.
>> >>
>> >>> 2nd file=---> A resultant file (has just one <key,value> pair) from a
>> >>>> previous MapReduce job. (I am implementing a chain MapReduce
>> function)
>> >>>>
>> >>> Can you explain in more detail the contents of 2nd file?
>> >>
>> >>>
>> >>>> Now, for every <key,value> pair in the user-input file, I would like
>> to
>> >>>> use
>> >>>> the same <key,value> pair from the 2nd file for some calculations.
>> >>>>
>> >>> Can you explain this in more detail? Can you give some abstracted
>> example
>> >> of how file1 and file2 look like and what operation/processing you want
>> to
>> >> do?
>> >>
>> >>
>> >>>>
>> >>> I guess you might need to do some kind of join on the 2 files. Look at
>> >>> contrib/data_join for more details.
>> >>> Amar
>> >>>
>> >>>> Is it possible for me to do so? Can someone guide me in the right
>> >>>> direction
>> >>>> please?
>> >>>>
>> >>>>
>> >>>> Thanks!
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>
>>
>>
>> --
>> Milind Bhandarkar
>> Y!IM: GridSolutions
>> 408-349-2136
>> (milindb@yahoo-inc.com)
>>
>>
>

Re: reading input for a map function from 2 different files?

Posted by some speed <sp...@gmail.com>.

Thank you all!!
What Milind has said will do the trick for me as a I need accurate values
for the deviation.
Passing variables between jobs by means of the Configure method and
getInt/setInt will make things a lot easier!

On Wed, Nov 12, 2008 at 7:07 PM, Milind Bhandarkar <mi...@yahoo-inc.com>wrote:

> Since you need to pass only one number (average) to all mappers, you can
> pass it through jobconf with a config variable defined by you, say
> "my.average"..
>
> - milind
>
>
> On 11/11/08 8:25 PM, "some speed" <sp...@gmail.com> wrote:
>
> > Thanks for the response. What I am trying is to do is finding the average
> > and then the standard deviation for a very large set (say a million) of
> > numbers. The result would be used in further calculations.
> > I have got the average from the first map-reduce chain. now i need to
> read
> > this average as well as the set of numbers to calculate the standard
> > deviation.  so one file would have the input set and the other
> "resultant"
> > file would have just the average.
> > Please do tell me in case there is a better way of doing things than what
> i
> > am doing. Any input/suggestion is appreciated.:)
> >
> >
> >
> > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <am...@yahoo-inc.com>
> wrote:
> >
> >> Amar Kamat wrote:
> >>
> >>> some speed wrote:
> >>>
> >>>> I was wondering if it was possible to read the input for a map
> function
> >>>> from
> >>>> 2 different files:
> >>>>  1st file ---> user-input file from a particular location(path)
> >>>>
> >>> Is the input/user file sorted? If yes then you can use "map-side join"
> for
> >> performance reasons. See org.apache.hadoop.mapred.join for more details.
> >>
> >>> 2nd file=---> A resultant file (has just one <key,value> pair) from a
> >>>> previous MapReduce job. (I am implementing a chain MapReduce function)
> >>>>
> >>> Can you explain in more detail the contents of 2nd file?
> >>
> >>>
> >>>> Now, for every <key,value> pair in the user-input file, I would like
> to
> >>>> use
> >>>> the same <key,value> pair from the 2nd file for some calculations.
> >>>>
> >>> Can you explain this in more detail? Can you give some abstracted
> example
> >> of how file1 and file2 look like and what operation/processing you want
> to
> >> do?
> >>
> >>
> >>>>
> >>> I guess you might need to do some kind of join on the 2 files. Look at
> >>> contrib/data_join for more details.
> >>> Amar
> >>>
> >>>> Is it possible for me to do so? Can someone guide me in the right
> >>>> direction
> >>>> please?
> >>>>
> >>>>
> >>>> Thanks!
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
>
>
> --
> Milind Bhandarkar
> Y!IM: GridSolutions
> 408-349-2136
> (milindb@yahoo-inc.com)
>
>

Re: reading input for a map function from 2 different files?

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

unless you really care about getting exact averages etc, i would
suggest simply sampling the input and computing your statistics from
that

--it will be a lot faster and you won't have to deal with under/overflow etc

if your sample is reasonably large then your results will be pretty
close to the true values

Miles

2008/11/12 Joel Welling <we...@psc.edu>:
> Amar, isn't there a problem with your method in that it gets a small
> result by subtracting very large numbers?  Given a million inputs, won't
> A and B be so much larger than the standard deviation that there aren't
> enough no bits left in the floating point number to represent it?
>
> I just thought I should mention that, before this thread goes in an
> archive somewhere and some student looks it up.
>
> -Joel
>
> On Wed, 2008-11-12 at 12:32 +0530, Amar Kamat wrote:
>> some speed wrote:
>> > Thanks for the response. What I am trying is to do is finding the average
>> > and then the standard deviation for a very large set (say a million) of
>> > numbers. The result would be used in further calculations.
>> > I have got the average from the first map-reduce chain. now i need to read
>> > this average as well as the set of numbers to calculate the standard
>> > deviation.  so one file would have the input set and the other "resultant"
>> > file would have just the average.
>> > Please do tell me in case there is a better way of doing things than what i
>> > am doing. Any input/suggestion is appreciated.:)
>> >
>> >
>> std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg.
>> Why dont you use the formula to compute it in one MR job.
>> std_dev^2 = (sum_i(Xi ^ 2)  - N * (Xa ^ 2) ) / N;
>>                  = (A - N*(avg^2))/N
>>
>> For this your map would look like
>>    map (key, val) : output.collect(key^2, key); // imagine your input as
>> (k,v) = (Xi, null)
>> Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and
>> sum over the values to find out 'Xa'. You could use the close() api to
>> finally dump there 2 values to a file.
>>
>> For example :
>> input : 1,2,3,4
>> Say input is split in 2 groups [1,2] and [4,5]
>> Now there will be 2 maps with output as follows
>> map1 output : (1,1) (4,2)
>> map2 output : (9,3) (16,4)
>>
>> Reducer will maintain the sum over all keys and all values
>> A = sum(key i.e  input squared) = 1+ 4 + 9 + 16 = 30
>> B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10
>>
>> With A and B you can compute the standard deviation offline.
>> So avg = B / N = 10/4 = 2.5
>> Hence the std deviation would be
>> sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399
>>
>> *Using the main formula the answer is *1.11803399*
>> Amar
>> >
>> > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
>> >
>> >
>> >> Amar Kamat wrote:
>> >>
>> >>
>> >>> some speed wrote:
>> >>>
>> >>>
>> >>>> I was wondering if it was possible to read the input for a map function
>> >>>> from
>> >>>> 2 different files:
>> >>>>  1st file ---> user-input file from a particular location(path)
>> >>>>
>> >>>>
>> >>> Is the input/user file sorted? If yes then you can use "map-side join" for
>> >>>
>> >> performance reasons. See org.apache.hadoop.mapred.join for more details.
>> >>
>> >>
>> >>> 2nd file=---> A resultant file (has just one <key,value> pair) from a
>> >>>
>> >>>> previous MapReduce job. (I am implementing a chain MapReduce function)
>> >>>>
>> >>>>
>> >>> Can you explain in more detail the contents of 2nd file?
>> >>>
>> >>>> Now, for every <key,value> pair in the user-input file, I would like to
>> >>>> use
>> >>>> the same <key,value> pair from the 2nd file for some calculations.
>> >>>>
>> >>>>
>> >>> Can you explain this in more detail? Can you give some abstracted example
>> >>>
>> >> of how file1 and file2 look like and what operation/processing you want to
>> >> do?
>> >>
>> >>
>> >>
>> >>> I guess you might need to do some kind of join on the 2 files. Look at
>> >>> contrib/data_join for more details.
>> >>> Amar
>> >>>
>> >>>
>> >>>> Is it possible for me to do so? Can someone guide me in the right
>> >>>> direction
>> >>>> please?
>> >>>>
>> >>>>
>> >>>> Thanks!
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >
>> >
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: reading input for a map function from 2 different files?

Posted by Joel Welling <we...@psc.edu>.

Amar, isn't there a problem with your method in that it gets a small
result by subtracting very large numbers?  Given a million inputs, won't
A and B be so much larger than the standard deviation that there aren't
enough no bits left in the floating point number to represent it?

I just thought I should mention that, before this thread goes in an
archive somewhere and some student looks it up.

-Joel

On Wed, 2008-11-12 at 12:32 +0530, Amar Kamat wrote:
> some speed wrote:
> > Thanks for the response. What I am trying is to do is finding the average
> > and then the standard deviation for a very large set (say a million) of
> > numbers. The result would be used in further calculations.
> > I have got the average from the first map-reduce chain. now i need to read
> > this average as well as the set of numbers to calculate the standard
> > deviation.  so one file would have the input set and the other "resultant"
> > file would have just the average.
> > Please do tell me in case there is a better way of doing things than what i
> > am doing. Any input/suggestion is appreciated.:)
> >
> >   
> std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg.
> Why dont you use the formula to compute it in one MR job.
> std_dev^2 = (sum_i(Xi ^ 2)  - N * (Xa ^ 2) ) / N;
>                  = (A - N*(avg^2))/N
> 
> For this your map would look like
>    map (key, val) : output.collect(key^2, key); // imagine your input as 
> (k,v) = (Xi, null)
> Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and 
> sum over the values to find out 'Xa'. You could use the close() api to 
> finally dump there 2 values to a file.
> 
> For example :
> input : 1,2,3,4
> Say input is split in 2 groups [1,2] and [4,5]
> Now there will be 2 maps with output as follows
> map1 output : (1,1) (4,2)
> map2 output : (9,3) (16,4)
> 
> Reducer will maintain the sum over all keys and all values
> A = sum(key i.e  input squared) = 1+ 4 + 9 + 16 = 30
> B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10
> 
> With A and B you can compute the standard deviation offline.
> So avg = B / N = 10/4 = 2.5
> Hence the std deviation would be
> sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399
> 
> *Using the main formula the answer is *1.11803399*
> Amar
> >
> > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
> >
> >   
> >> Amar Kamat wrote:
> >>
> >>     
> >>> some speed wrote:
> >>>
> >>>       
> >>>> I was wondering if it was possible to read the input for a map function
> >>>> from
> >>>> 2 different files:
> >>>>  1st file ---> user-input file from a particular location(path)
> >>>>
> >>>>         
> >>> Is the input/user file sorted? If yes then you can use "map-side join" for
> >>>       
> >> performance reasons. See org.apache.hadoop.mapred.join for more details.
> >>
> >>     
> >>> 2nd file=---> A resultant file (has just one <key,value> pair) from a
> >>>       
> >>>> previous MapReduce job. (I am implementing a chain MapReduce function)
> >>>>
> >>>>         
> >>> Can you explain in more detail the contents of 2nd file?
> >>>       
> >>>> Now, for every <key,value> pair in the user-input file, I would like to
> >>>> use
> >>>> the same <key,value> pair from the 2nd file for some calculations.
> >>>>
> >>>>         
> >>> Can you explain this in more detail? Can you give some abstracted example
> >>>       
> >> of how file1 and file2 look like and what operation/processing you want to
> >> do?
> >>
> >>
> >>     
> >>> I guess you might need to do some kind of join on the 2 files. Look at
> >>> contrib/data_join for more details.
> >>> Amar
> >>>
> >>>       
> >>>> Is it possible for me to do so? Can someone guide me in the right
> >>>> direction
> >>>> please?
> >>>>
> >>>>
> >>>> Thanks!
> >>>>
> >>>>
> >>>>
> >>>>         
> >>>       
> >
> >

Re: reading input for a map function from 2 different files?

Posted by Amar Kamat <am...@yahoo-inc.com>.

some speed wrote:
> Thanks for the response. What I am trying is to do is finding the average
> and then the standard deviation for a very large set (say a million) of
> numbers. The result would be used in further calculations.
> I have got the average from the first map-reduce chain. now i need to read
> this average as well as the set of numbers to calculate the standard
> deviation.  so one file would have the input set and the other "resultant"
> file would have just the average.
> Please do tell me in case there is a better way of doing things than what i
> am doing. Any input/suggestion is appreciated.:)
>
>   
std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg.
Why dont you use the formula to compute it in one MR job.
std_dev^2 = (sum_i(Xi ^ 2)  - N * (Xa ^ 2) ) / N;
                 = (A - N*(avg^2))/N

For this your map would look like
   map (key, val) : output.collect(key^2, key); // imagine your input as 
(k,v) = (Xi, null)
Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and 
sum over the values to find out 'Xa'. You could use the close() api to 
finally dump there 2 values to a file.

For example :
input : 1,2,3,4
Say input is split in 2 groups [1,2] and [4,5]
Now there will be 2 maps with output as follows
map1 output : (1,1) (4,2)
map2 output : (9,3) (16,4)

Reducer will maintain the sum over all keys and all values
A = sum(key i.e  input squared) = 1+ 4 + 9 + 16 = 30
B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10

With A and B you can compute the standard deviation offline.
So avg = B / N = 10/4 = 2.5
Hence the std deviation would be
sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399

*Using the main formula the answer is *1.11803399*
Amar
>
> On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
>
>   
>> Amar Kamat wrote:
>>
>>     
>>> some speed wrote:
>>>
>>>       
>>>> I was wondering if it was possible to read the input for a map function
>>>> from
>>>> 2 different files:
>>>>  1st file ---> user-input file from a particular location(path)
>>>>
>>>>         
>>> Is the input/user file sorted? If yes then you can use "map-side join" for
>>>       
>> performance reasons. See org.apache.hadoop.mapred.join for more details.
>>
>>     
>>> 2nd file=---> A resultant file (has just one <key,value> pair) from a
>>>       
>>>> previous MapReduce job. (I am implementing a chain MapReduce function)
>>>>
>>>>         
>>> Can you explain in more detail the contents of 2nd file?
>>>       
>>>> Now, for every <key,value> pair in the user-input file, I would like to
>>>> use
>>>> the same <key,value> pair from the 2nd file for some calculations.
>>>>
>>>>         
>>> Can you explain this in more detail? Can you give some abstracted example
>>>       
>> of how file1 and file2 look like and what operation/processing you want to
>> do?
>>
>>
>>     
>>> I guess you might need to do some kind of join on the 2 files. Look at
>>> contrib/data_join for more details.
>>> Amar
>>>
>>>       
>>>> Is it possible for me to do so? Can someone guide me in the right
>>>> direction
>>>> please?
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>         
>>>       
>
>

Re: reading input for a map function from 2 different files?

Posted by Milind Bhandarkar <mi...@yahoo-inc.com>.

Since you need to pass only one number (average) to all mappers, you can
pass it through jobconf with a config variable defined by you, say
"my.average"..

- milind


On 11/11/08 8:25 PM, "some speed" <sp...@gmail.com> wrote:

> Thanks for the response. What I am trying is to do is finding the average
> and then the standard deviation for a very large set (say a million) of
> numbers. The result would be used in further calculations.
> I have got the average from the first map-reduce chain. now i need to read
> this average as well as the set of numbers to calculate the standard
> deviation.  so one file would have the input set and the other "resultant"
> file would have just the average.
> Please do tell me in case there is a better way of doing things than what i
> am doing. Any input/suggestion is appreciated.:)
> 
> 
> 
> On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
> 
>> Amar Kamat wrote:
>> 
>>> some speed wrote:
>>> 
>>>> I was wondering if it was possible to read the input for a map function
>>>> from
>>>> 2 different files:
>>>>  1st file ---> user-input file from a particular location(path)
>>>> 
>>> Is the input/user file sorted? If yes then you can use "map-side join" for
>> performance reasons. See org.apache.hadoop.mapred.join for more details.
>> 
>>> 2nd file=---> A resultant file (has just one <key,value> pair) from a
>>>> previous MapReduce job. (I am implementing a chain MapReduce function)
>>>> 
>>> Can you explain in more detail the contents of 2nd file?
>> 
>>> 
>>>> Now, for every <key,value> pair in the user-input file, I would like to
>>>> use
>>>> the same <key,value> pair from the 2nd file for some calculations.
>>>> 
>>> Can you explain this in more detail? Can you give some abstracted example
>> of how file1 and file2 look like and what operation/processing you want to
>> do?
>> 
>> 
>>>> 
>>> I guess you might need to do some kind of join on the 2 files. Look at
>>> contrib/data_join for more details.
>>> Amar
>>> 
>>>> Is it possible for me to do so? Can someone guide me in the right
>>>> direction
>>>> please?
>>>> 
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 


-- 
Milind Bhandarkar
Y!IM: GridSolutions
408-349-2136 
(milindb@yahoo-inc.com)

Re: reading input for a map function from 2 different files?

Posted by some speed <sp...@gmail.com>.

Thanks for the response. What I am trying is to do is finding the average
and then the standard deviation for a very large set (say a million) of
numbers. The result would be used in further calculations.
I have got the average from the first map-reduce chain. now i need to read
this average as well as the set of numbers to calculate the standard
deviation.  so one file would have the input set and the other "resultant"
file would have just the average.
Please do tell me in case there is a better way of doing things than what i
am doing. Any input/suggestion is appreciated.:)

On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <am...@yahoo-inc.com> wrote:

> Amar Kamat wrote:
>
>> some speed wrote:
>>
>>> I was wondering if it was possible to read the input for a map function
>>> from
>>> 2 different files:
>>>  1st file ---> user-input file from a particular location(path)
>>>
>> Is the input/user file sorted? If yes then you can use "map-side join" for
> performance reasons. See org.apache.hadoop.mapred.join for more details.
>
>> 2nd file=---> A resultant file (has just one <key,value> pair) from a
>>> previous MapReduce job. (I am implementing a chain MapReduce function)
>>>
>> Can you explain in more detail the contents of 2nd file?
>
>>
>>> Now, for every <key,value> pair in the user-input file, I would like to
>>> use
>>> the same <key,value> pair from the 2nd file for some calculations.
>>>
>> Can you explain this in more detail? Can you give some abstracted example
> of how file1 and file2 look like and what operation/processing you want to
> do?
>
>
>>>
>> I guess you might need to do some kind of join on the 2 files. Look at
>> contrib/data_join for more details.
>> Amar
>>
>>> Is it possible for me to do so? Can someone guide me in the right
>>> direction
>>> please?
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>
>>
>

Re: reading input for a map function from 2 different files?

Posted by Amar Kamat <am...@yahoo-inc.com>.

Amar Kamat wrote:
> some speed wrote:
>> I was wondering if it was possible to read the input for a map 
>> function from
>> 2 different files:
>>   1st file ---> user-input file from a particular location(path)
Is the input/user file sorted? If yes then you can use "map-side join" 
for performance reasons. See org.apache.hadoop.mapred.join for more 
details.
>> 2nd file=---> A resultant file (has just one <key,value> pair) from a
>> previous MapReduce job. (I am implementing a chain MapReduce function)
Can you explain in more detail the contents of 2nd file?
>>
>> Now, for every <key,value> pair in the user-input file, I would like 
>> to use
>> the same <key,value> pair from the 2nd file for some calculations.
Can you explain this in more detail? Can you give some abstracted 
example of how file1 and file2 look like and what operation/processing 
you want to do?
>>   
> I guess you might need to do some kind of join on the 2 files. Look at 
> contrib/data_join for more details.
> Amar
>> Is it possible for me to do so? Can someone guide me in the right 
>> direction
>> please?
>>
>>
>> Thanks!
>>
>>   
>

Re: reading input for a map function from 2 different files?

Posted by Amar Kamat <am...@yahoo-inc.com>.

some speed wrote:
> I was wondering if it was possible to read the input for a map function from
> 2 different files:
>   
> 1st file ---> user-input file from a particular location(path)
> 2nd file=---> A resultant file (has just one <key,value> pair) from a
> previous MapReduce job. (I am implementing a chain MapReduce function)
>
> Now, for every <key,value> pair in the user-input file, I would like to use
> the same <key,value> pair from the 2nd file for some calculations.
>   
I guess you might need to do some kind of join on the 2 files. Look at 
contrib/data_join for more details.
Amar
> Is it possible for me to do so? Can someone guide me in the right direction
> please?
>
>
> Thanks!
>
>

Re: reading input for a map function from 2 different files?

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

some speed wrote:
> I was wondering if it was possible to read the input for a map function from
> 2 different files:
>
> 1st file ---> user-input file from a particular location(path)
> 2nd file=---> A resultant file (has just one <key,value> pair) from a
> previous MapReduce job. (I am implementing a chain MapReduce function)
>
> Now, for every <key,value> pair in the user-input file, I would like to use
> the same <key,value> pair from the 2nd file for some calculations.
>
>   
I think you can use DistributedCache for distributing your second file 
among maps.
Please see more documentation at 
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache

Thanks
Amareshwari
> Is it possible for me to do so? Can someone guide me in the right direction
> please?
>
>
> Thanks!
>
>