You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sid123 <it...@gmail.com> on 2009/03/27 22:19:24 UTC

Iterative feedback in map reduce....

HI,
I have to design an iterative algorithm, each iteration is a M-R cycle that
calculates a parameter and has to feed it back to all the maps in the next
iteration.
Now the reduce procedure I need to just sum eveything from the Map
procedure(Many similar size matrices) into a single matrix(of same size as
each reduce ), irrespective of the key. This single matrix is the parameter
I was taking about earlier.
i want to know. PS This parameter MUST BE global to  all map processes.

1) How do I collect all the values into one single parameter? Do I need to
write it to the File system or can i keep it in memory? I feel that I WILL
have to write it to the HDFS somewhere... 
-- 
View this message in context: http://www.nabble.com/Iterative-feedback-in-map-reduce....-tp22748317p22748317.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Iterative feedback in map reduce....

Posted by Kevin Peterson <kp...@biz360.com>.

On Fri, Mar 27, 2009 at 4:39 PM, Sid123 <it...@gmail.com> wrote:

> But I was thinking of grouping the values and generating a key using a
> random number generator in the collector of the mapper. The values will now
> be uniformly distributed over a few keys. Say the number of keys will be
> 0.1% of the # of values or atleast 1, which ever is higher. So if there
> 20000 values 2000 odd values should be under a single key.. and 10 reducers
> should spawn to do the sum in parallel...  Now I can atleast run 10 sum in
> parallel rather than just 1 reducer doing the whole work... How does that
> theory seem?
>

What you want to do is write a combiner, which is essentially a reducer that
runs on the map output of a single node before before being sent to the main
reducer. Then the real reducer would get one value per node.

Re: Iterative feedback in map reduce....

Posted by Sid123 <it...@gmail.com>.

Thanks for the help Peter... Looks like the mapper is writing out to a common
key and adding all the values to the HDFS The mapper(s) will just serialize
over one another to write to the disc... I will be making the code for this
tonight... So can you answer a tech question... Since all the values are
being grouped under a common key how many reduce processes do you think will
be spawned? i am thinking 1 which is bad....
But I was thinking of grouping the values and generating a key using a
random number generator in the collector of the mapper. The values will now
be uniformly distributed over a few keys. Say the number of keys will be
0.1% of the # of values or atleast 1, which ever is higher. So if there
20000 values 2000 odd values should be under a single key.. and 10 reducers
should spawn to do the sum in parallel...  Now I can atleast run 10 sum in
parallel rather than just 1 reducer doing the whole work... How does that
theory seem? 

Peter Skomoroch wrote:
> 
> Check out the EM example in nltk:
> 
> http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk_contrib/hadoop/EM/runStreaming.py
> 
> On Fri, Mar 27, 2009 at 5:19 PM, Sid123 <it...@gmail.com> wrote:
> 
>>
>> HI,
>> I have to design an iterative algorithm, each iteration is a M-R cycle
>> that
>> calculates a parameter and has to feed it back to all the maps in the
>> next
>> iteration.
>> Now the reduce procedure I need to just sum eveything from the Map
>> procedure(Many similar size matrices) into a single matrix(of same size
>> as
>> each reduce ), irrespective of the key. This single matrix is the
>> parameter
>> I was taking about earlier.
>> i want to know. PS This parameter MUST BE global to  all map processes.
>>
>> 1) How do I collect all the values into one single parameter? Do I need
>> to
>> write it to the File system or can i keep it in memory? I feel that I
>> WILL
>> have to write it to the HDFS somewhere...
>> --
>> View this message in context:
>> http://www.nabble.com/Iterative-feedback-in-map-reduce....-tp22748317p22748317.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
> 
> 

-- 
View this message in context: http://www.nabble.com/Iterative-feedback-in-map-reduce....-tp22748317p22751900.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Iterative feedback in map reduce....

Posted by Peter Skomoroch <pe...@gmail.com>.

Check out the EM example in nltk:

http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk_contrib/hadoop/EM/runStreaming.py

On Fri, Mar 27, 2009 at 5:19 PM, Sid123 <it...@gmail.com> wrote:

>
> HI,
> I have to design an iterative algorithm, each iteration is a M-R cycle that
> calculates a parameter and has to feed it back to all the maps in the next
> iteration.
> Now the reduce procedure I need to just sum eveything from the Map
> procedure(Many similar size matrices) into a single matrix(of same size as
> each reduce ), irrespective of the key. This single matrix is the parameter
> I was taking about earlier.
> i want to know. PS This parameter MUST BE global to  all map processes.
>
> 1) How do I collect all the values into one single parameter? Do I need to
> write it to the File system or can i keep it in memory? I feel that I WILL
> have to write it to the HDFS somewhere...
> --
> View this message in context:
> http://www.nabble.com/Iterative-feedback-in-map-reduce....-tp22748317p22748317.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch