You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Antonio Piccolboni <an...@piccolboni.info> on 2010/11/03 07:47:19 UTC

Re: Statistics and Early Keys to Reducers

You are foregoing the use of a combiner in your solution, therefore yours is
a non scalable solution. Imagine what happens when all or most words have
the same length without a combiner. Most computation goes through one
reducer while the others watch -- you with them. You can fix it by having
the mapper emit word, 1, 1 (word can have the special value #n for overall
counts) and the reducer emit word, wordCount / statistics[k2[0]],
statistics[k2[0]].
Then you can turn the sum into a weighted sum ... I don't even want to go
there. Just do three separate mapreduces and you'll be much better off and
have reusable modules. First map reduce is a standard wordcount algorithm,
the second map reduce maps word, count to length, count and reduces to
length, sum(counts) (again you can use a combiner and the data set is
smaller to start with). For the final mapreduce you could distribute the
counts by length with distributed cache to every mapper and map word, count
 to word count/ word length count


Antonio

On Thu, Oct 28, 2010 at 10:03 AM, Ricky Ho <ri...@yahoo.com> wrote:

> Of course you can use two round of map reduce with the first round compute
> the
> statistics and the second round compute the percentile.
>
> But I don't think this is better than your solution ... which is the most
> optimal one that I can think of.  Here is the pseudo code of your solution
> ...
>
>
>
> map(k1, doc) {
>    for each word in doc {
>        k2 = [word.length, "#"]
>
>        emit(k2, 1)
>        k2 = [word.length, word]
>
>        emit(k2, 1)    }
>
> }
>
> partition(k2) {
>    k2[0] % NoOfReducers
>
> }
>
> # key = word length, value = count
>
> statistics = Hash.new
>
> reduce(k2, listOfCounts) {
>    if k2[1] == "#" {
>
>        statistics[k2[0]] ++
>    } else {
>        wordCount = 0
>
>        for each count in listOfCounts {
>            wordCount = wordCount + count
>        }
>        emit(word, wordCount / statistics[k2[0]]
>
> }
>
> Rgds,
> Ricky
>
> -----Original Message-----
> From: Steve Lewis [mailto:lordjoe2000@gmail.com]
> Sent: Thursday, October 28, 2010 8:53 AM
> To: common-user
> Subject: Statistics and Early Keys to Reducers
>
> Imaging I have the following problem - I want to call a standard word count
> program but instead of having the reducer output the word and
> its count I want it to output the word and the count / (total count of
> words
> of that length)
>
> The total count of words of a given length - say 1..100 seen by each mapper
> is known at the end of the map step
>
> In theory each mapper could send its total to every reducer and before the
> rest of the reduce step each reducer could
> compute the grand total
>
> This requires
> 1) Statistics are sent with a key which sort ahead of all others
> 2) Statistics are send as the mapper is closing
> 3) Somehow each  mapper sends statistics with proper keys so a copy is
> delivered to every reducer
>
> Is this a reasonable approach - are there others
> What do folks think
> --
> Steven M. Lewis PhD
> 4221 105th Ave Ne
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Institute for Systems Biology
> Seattle WA
>
>
>
>