You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Torsten Curdt <tc...@vafer.org> on 2010/06/04 17:28:02 UTC

cumulative counts over time

Hey folks,

I have the following keys/lines as input

 2010-03-01 11:56/A -> 1
 2010-03-01 11:57/A -> 1
 2010-03-01 11:57/A -> 1
 2010-03-01 11:57/B -> 1
 2010-03-01 11:58/B -> 1
 2010-03-01 11:58/A -> 1
 2010-03-01 11:59/A -> 1

for each of these lines I do one emit. Similar to the word count
example I can just add them in the reduce phase to get the totals:

 2010-03-01 11:56/A -> 1
 2010-03-01 11:57/A -> 2
 2010-03-01 11:57/B -> 1
 2010-03-01 11:58/B -> 1
 2010-03-01 11:58/A -> 1
 2010-03-01 11:59/A -> 1

Great. Now I know that in minute 2010-03-01 11:57 A had 2 emits. What
I also like to have though is the totals cumulated from the start of
the mapreduce range.

 2010-03-01 11:56/A -> 1,1
 2010-03-01 11:57/A -> 2,3
 2010-03-01 11:57/B -> 1,1
 2010-03-01 11:58/B -> 1,2
 2010-03-01 11:58/A -> 1,4
 2010-03-01 11:59/A -> 1,5

So at 2010-03-01 11:58 A had 1 emit but a total of 5 emits since
2010-03-01 11:56.

The only way I could think to solve this in a distributed context is
to also emit for the future until the end of the mapreduce range and
then sum and reduce this.

 2010-03-01 11:56/A -> 1
  2010-03-01 11:57/A -> 1
  2010-03-01 11:58/A -> 1
  2010-03-01 11:59/A -> 1
 2010-03-01 11:57/A -> 1
  2010-03-01 11:58/A -> 1
  2010-03-01 11:59/A -> 1
 2010-03-01 11:57/A -> 1
  2010-03-01 11:58/A -> 1
  2010-03-01 11:59/A -> 1
 2010-03-01 11:57/B -> 1
  2010-03-01 11:58/B -> 1
  2010-03-01 11:59/B -> 1
 2010-03-01 11:58/B -> 1
  2010-03-01 11:59/B -> 1
 2010-03-01 11:58/A -> 1
  2010-03-01 11:59/A -> 1
 2010-03-01 11:59/A -> 1

But for longer time ranges this leads to an explosion of emits.

Could anyone think of a better way of doing this?

cheers
--
Torsten