You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Torsten Curdt <tc...@vafer.org> on 2010/06/04 17:28:02 UTC
cumulative counts over time
Hey folks,
I have the following keys/lines as input
2010-03-01 11:56/A -> 1
2010-03-01 11:57/A -> 1
2010-03-01 11:57/A -> 1
2010-03-01 11:57/B -> 1
2010-03-01 11:58/B -> 1
2010-03-01 11:58/A -> 1
2010-03-01 11:59/A -> 1
for each of these lines I do one emit. Similar to the word count
example I can just add them in the reduce phase to get the totals:
2010-03-01 11:56/A -> 1
2010-03-01 11:57/A -> 2
2010-03-01 11:57/B -> 1
2010-03-01 11:58/B -> 1
2010-03-01 11:58/A -> 1
2010-03-01 11:59/A -> 1
Great. Now I know that in minute 2010-03-01 11:57 A had 2 emits. What
I also like to have though is the totals cumulated from the start of
the mapreduce range.
2010-03-01 11:56/A -> 1,1
2010-03-01 11:57/A -> 2,3
2010-03-01 11:57/B -> 1,1
2010-03-01 11:58/B -> 1,2
2010-03-01 11:58/A -> 1,4
2010-03-01 11:59/A -> 1,5
So at 2010-03-01 11:58 A had 1 emit but a total of 5 emits since
2010-03-01 11:56.
The only way I could think to solve this in a distributed context is
to also emit for the future until the end of the mapreduce range and
then sum and reduce this.
2010-03-01 11:56/A -> 1
2010-03-01 11:57/A -> 1
2010-03-01 11:58/A -> 1
2010-03-01 11:59/A -> 1
2010-03-01 11:57/A -> 1
2010-03-01 11:58/A -> 1
2010-03-01 11:59/A -> 1
2010-03-01 11:57/A -> 1
2010-03-01 11:58/A -> 1
2010-03-01 11:59/A -> 1
2010-03-01 11:57/B -> 1
2010-03-01 11:58/B -> 1
2010-03-01 11:59/B -> 1
2010-03-01 11:58/B -> 1
2010-03-01 11:59/B -> 1
2010-03-01 11:58/A -> 1
2010-03-01 11:59/A -> 1
2010-03-01 11:59/A -> 1
But for longer time ranges this leads to an explosion of emits.
Could anyone think of a better way of doing this?
cheers
--
Torsten