You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Oleg Ruchovets <or...@gmail.com> on 2013/01/28 13:56:29 UTC

aggregation by time window

Hi ,
    I have such row data structure:

event_id  |   time
==============
event1     |  10:07
event2     |  10:10
event3     |  10:12

event4     |   10:20
event5     |   10:23
event6     |   10:25

Numbers of records is  50-100 million.

Question:
   I need to get events that was during time T.

For example: if T=7 munutes.
     event1 , event2 , event3 were detected durint 7 minutes.
     event4 , event5 , event6 were detected during 7 minutes.

How can I implement such aggregation using map/reduce.

Thanks
Oleg.

Re: aggregation by time window

Posted by Oleg Ruchovets <or...@gmail.com>.

Hi , Zhiwei.
    No :-). Every 7 minutes is is easy. just transform time to
milisecond/7*60000 will give you a bucket key.

I need to do the following:
    Find the events which was dirung time T related to the event X.

In very naive approach I need to take first event and find other events
which happend during 7 minutes from first event time. But I think it will
be very slow and I am looking for a way to improve this naive approach.

Thanks
Oleg.



On Mon, Jan 28, 2013 at 3:09 PM, Zhiwei Lin <zh...@gmail.com> wrote:

> do you mean every 7 mins?
> e.g, [10:07, 10:14),
>        [10:14, 10:21) .....
>
> On 28 January 2013 12:56, Oleg Ruchovets <or...@gmail.com> wrote:
>
> > Hi ,
> >     I have such row data structure:
> >
> > event_id  |   time
> > ==============
> > event1     |  10:07
> > event2     |  10:10
> > event3     |  10:12
> >
> > event4     |   10:20
> > event5     |   10:23
> > event6     |   10:25
> >
> > Numbers of records is  50-100 million.
> >
> > Question:
> >    I need to get events that was during time T.
> >
> > For example: if T=7 munutes.
> >      event1 , event2 , event3 were detected durint 7 minutes.
> >      event4 , event5 , event6 were detected during 7 minutes.
> >
> > How can I implement such aggregation using map/reduce.
> >
> > Thanks
> > Oleg.
> >
>
>
>
> --
>
> Best wishes.
>
> Zhiwei
>

Re: aggregation by time window

Posted by Zhiwei Lin <zh...@gmail.com>.

do you mean every 7 mins?
e.g, [10:07, 10:14),
       [10:14, 10:21) .....

On 28 January 2013 12:56, Oleg Ruchovets <or...@gmail.com> wrote:

> Hi ,
>     I have such row data structure:
>
> event_id  |   time
> ==============
> event1     |  10:07
> event2     |  10:10
> event3     |  10:12
>
> event4     |   10:20
> event5     |   10:23
> event6     |   10:25
>
> Numbers of records is  50-100 million.
>
> Question:
>    I need to get events that was during time T.
>
> For example: if T=7 munutes.
>      event1 , event2 , event3 were detected durint 7 minutes.
>      event4 , event5 , event6 were detected during 7 minutes.
>
> How can I implement such aggregation using map/reduce.
>
> Thanks
> Oleg.
>



-- 

Best wishes.

Zhiwei

Re: aggregation by time window

Posted by Oleg Ruchovets <or...@gmail.com>.

Well , much more clear , but still have a questions :-)

Suppose we have 3 map input records

event1 | 10:07
event2 | 10:10
event3 | 10:12

Output from map(event1 | 10:07) will be :

 mapOutput(10:04:event1)
 mapOutput(10:05:event1)
 mapOutput(10:06:event1)
 mapOutput(10:07:event1)
 mapOutput(10:08:event1)
 mapOutput(10:09:event1)
 mapOutput(10:10:event1)

Output  for map(event2 | 10:10) will be:

mapOutput(10:07:event2)
mapOutput(10:08:event2)
mapOutput(10:09:event2)
mapOutput(10:10:event2)
mapOutput(10:11:event2)
mapOutput(10:12:event2)
mapOutput(10:13:event2)

 Output for map (event3 |  10:12) will be:
mapOutput(10:09: event3)
mapOutput(10:10 : event3)
mapOutput(10:11 : event3)
mapOutput(10:12 : event3)
mapOutput(10:13 : event3)
mapOutput(10:14 : event3)
mapOutput(10:15 : event3)

 Is it correct?

If yes ,

in reducer phase I will get such inputs:

reducer(10:04:event1)
reducer(10:05:event1)
reducer(10:06:event1)
reducer(10:07:event1 ,event2)
reducer(10:08:event1 , event2)
reducer(10:09:event1 , event2 , event3)
reducer(10:10:event1 , event2 , event3)
reducer(10:11:event3)
reducer(10:12:event3)
reducer(10:13:event3)
reducer(10:14:event3)
reducer(10:15:event3)

Iterating over each reducer input how can I know at the end of aggregations
which events were during 7 minutes?

Thansk
Oleg.



On Mon, Jan 28, 2013 at 3:48 PM, Kai Voigt <k...@123.org> wrote:

> Hi again,
>
> the idea is that you emit every event multiple times. So your map input
> record (event1, 10:07) will be emitted seven times during the map() call.
> Like I said, (10:04,event1), (10:05,event1), ..., (10:10,event1) will be
> the seven outputs for processing a single event.
>
> The output key will be the time stamps in which neighbourhood or interval
> each event should be joined with events that happened +/- 3 minutes near
> it. So events which happened within a 7 minutes distance will both be
> emitted with the same time stamp as the map() output, and thus meet in a
> reduce() call.
>
> A reduce() call will look like this: reduce(10:03, list_of_events). And
> those events had time stamps between 10:00 and 10:06 in the original input.
>
> Kai
>
> Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets <or...@gmail.com>:
>
> > Hi Kai.
> >    It is very interesting. Can you please explain in more details your
> > Idea?
> > What will be a key in a map phase?
> >
> > Suppose we have event at 10:07. How would you emit this to the multiple
> > buckets?
> >
> > Thanks
> > Oleg.
> >
> >
> > On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <k...@123.org> wrote:
> >
> >> Quick idea:
> >>
> >> since each of your events will go into several buckets, you could use
> >> map() to emit each item multiple times for each bucket.
> >>
> >> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <or...@gmail.com>:
> >>
> >>> Hi ,
> >>>   I have such row data structure:
> >>>
> >>> event_id  |   time
> >>> ==============
> >>> event1     |  10:07
> >>> event2     |  10:10
> >>> event3     |  10:12
> >>>
> >>> event4     |   10:20
> >>> event5     |   10:23
> >>> event6     |   10:25
> >>
> >> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
> >> (10:10,event1) and so on.
> >>
> >> In reduce(), all your desired events would meet for the same minute.
> >>
> >> Kai
> >>
> >> --
> >> Kai Voigt
> >> k@123.org
> >>
> >>
> >>
> >>
> >>
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: aggregation by time window

Posted by Kai Voigt <k...@123.org>.

Hi again,

the idea is that you emit every event multiple times. So your map input record (event1, 10:07) will be emitted seven times during the map() call. Like I said, (10:04,event1), (10:05,event1), ..., (10:10,event1) will be the seven outputs for processing a single event.

The output key will be the time stamps in which neighbourhood or interval each event should be joined with events that happened +/- 3 minutes near it. So events which happened within a 7 minutes distance will both be emitted with the same time stamp as the map() output, and thus meet in a reduce() call.

A reduce() call will look like this: reduce(10:03, list_of_events). And those events had time stamps between 10:00 and 10:06 in the original input.

Kai

Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets <or...@gmail.com>:

> Hi Kai.
>    It is very interesting. Can you please explain in more details your
> Idea?
> What will be a key in a map phase?
> 
> Suppose we have event at 10:07. How would you emit this to the multiple
> buckets?
> 
> Thanks
> Oleg.
> 
> 
> On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <k...@123.org> wrote:
> 
>> Quick idea:
>> 
>> since each of your events will go into several buckets, you could use
>> map() to emit each item multiple times for each bucket.
>> 
>> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <or...@gmail.com>:
>> 
>>> Hi ,
>>>   I have such row data structure:
>>> 
>>> event_id  |   time
>>> ==============
>>> event1     |  10:07
>>> event2     |  10:10
>>> event3     |  10:12
>>> 
>>> event4     |   10:20
>>> event5     |   10:23
>>> event6     |   10:25
>> 
>> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
>> (10:10,event1) and so on.
>> 
>> In reduce(), all your desired events would meet for the same minute.
>> 
>> Kai
>> 
>> --
>> Kai Voigt
>> k@123.org
>> 
>> 
>> 
>> 
>> 

-- 
Kai Voigt
k@123.org

Re: aggregation by time window

Posted by Oleg Ruchovets <or...@gmail.com>.

Hi Kai.
    It is very interesting. Can you please explain in more details your
Idea?
What will be a key in a map phase?

Suppose we have event at 10:07. How would you emit this to the multiple
buckets?

Thanks
Oleg.


On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <k...@123.org> wrote:

> Quick idea:
>
> since each of your events will go into several buckets, you could use
> map() to emit each item multiple times for each bucket.
>
> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <or...@gmail.com>:
>
> > Hi ,
> >    I have such row data structure:
> >
> > event_id  |   time
> > ==============
> > event1     |  10:07
> > event2     |  10:10
> > event3     |  10:12
> >
> > event4     |   10:20
> > event5     |   10:23
> > event6     |   10:25
>
> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
> (10:10,event1) and so on.
>
> In reduce(), all your desired events would meet for the same minute.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: aggregation by time window

Posted by Kai Voigt <k...@123.org>.

Quick idea:

since each of your events will go into several buckets, you could use map() to emit each item multiple times for each bucket.

Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <or...@gmail.com>:

> Hi ,
>    I have such row data structure:
> 
> event_id  |   time
> ==============
> event1     |  10:07
> event2     |  10:10
> event3     |  10:12
> 
> event4     |   10:20
> event5     |   10:23
> event6     |   10:25

map(event1,10:07) would emit (10:04,event1), (10:05,event1), ..., (10:10,event1) and so on.

In reduce(), all your desired events would meet for the same minute.

Kai

-- 
Kai Voigt
k@123.org