You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2010/10/28 17:53:04 UTC
Statistics and Early Keys to Reducers
Imaging I have the following problem - I want to call a standard word count
program but instead of having the reducer output the word and
its count I want it to output the word and the count / (total count of words
of that length)
The total count of words of a given length - say 1..100 seen by each mapper
is known at the end of the map step
In theory each mapper could send its total to every reducer and before the
rest of the reduce step each reducer could
compute the grand total
This requires
1) Statistics are sent with a key which sort ahead of all others
2) Statistics are send as the mapper is closing
3) Somehow each mapper sends statistics with proper keys so a copy is
delivered to every reducer
Is this a reasonable approach - are there others
What do folks think
--
Steven M. Lewis PhD
4221 105th Ave Ne
Kirkland, WA 98033
206-384-1340 (cell)
Institute for Systems Biology
Seattle WA