You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2010/10/28 17:53:04 UTC

Statistics and Early Keys to Reducers

Imaging I have the following problem - I want to call a standard word count
program but instead of having the reducer output the word and
its count I want it to output the word and the count / (total count of words
of that length)

The total count of words of a given length - say 1..100 seen by each mapper
is known at the end of the map step

In theory each mapper could send its total to every reducer and before the
rest of the reduce step each reducer could
compute the grand total

This requires
1) Statistics are sent with a key which sort ahead of all others
2) Statistics are send as the mapper is closing
3) Somehow each  mapper sends statistics with proper keys so a copy is
delivered to every reducer

Is this a reasonable approach - are there others
What do folks think
-- 
Steven M. Lewis PhD
4221 105th Ave Ne
Kirkland, WA 98033
206-384-1340 (cell)
Institute for Systems Biology
Seattle WA