You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Oleg Ruchovets <or...@gmail.com> on 2013/01/28 13:56:29 UTC
aggregation by time window
Hi ,
I have such row data structure:
event_id | time
==============
event1 | 10:07
event2 | 10:10
event3 | 10:12
event4 | 10:20
event5 | 10:23
event6 | 10:25
Numbers of records is 50-100 million.
Question:
I need to get events that was during time T.
For example: if T=7 munutes.
event1 , event2 , event3 were detected durint 7 minutes.
event4 , event5 , event6 were detected during 7 minutes.
How can I implement such aggregation using map/reduce.
Thanks
Oleg.
Re: aggregation by time window
Posted by Oleg Ruchovets <or...@gmail.com>.
Hi , Zhiwei.
No :-). Every 7 minutes is is easy. just transform time to
milisecond/7*60000 will give you a bucket key.
I need to do the following:
Find the events which was dirung time T related to the event X.
In very naive approach I need to take first event and find other events
which happend during 7 minutes from first event time. But I think it will
be very slow and I am looking for a way to improve this naive approach.
Thanks
Oleg.
On Mon, Jan 28, 2013 at 3:09 PM, Zhiwei Lin <zh...@gmail.com> wrote:
> do you mean every 7 mins?
> e.g, [10:07, 10:14),
> [10:14, 10:21) .....
>
> On 28 January 2013 12:56, Oleg Ruchovets <or...@gmail.com> wrote:
>
> > Hi ,
> > I have such row data structure:
> >
> > event_id | time
> > ==============
> > event1 | 10:07
> > event2 | 10:10
> > event3 | 10:12
> >
> > event4 | 10:20
> > event5 | 10:23
> > event6 | 10:25
> >
> > Numbers of records is 50-100 million.
> >
> > Question:
> > I need to get events that was during time T.
> >
> > For example: if T=7 munutes.
> > event1 , event2 , event3 were detected durint 7 minutes.
> > event4 , event5 , event6 were detected during 7 minutes.
> >
> > How can I implement such aggregation using map/reduce.
> >
> > Thanks
> > Oleg.
> >
>
>
>
> --
>
> Best wishes.
>
> Zhiwei
>
Re: aggregation by time window
Posted by Zhiwei Lin <zh...@gmail.com>.
do you mean every 7 mins?
e.g, [10:07, 10:14),
[10:14, 10:21) .....
On 28 January 2013 12:56, Oleg Ruchovets <or...@gmail.com> wrote:
> Hi ,
> I have such row data structure:
>
> event_id | time
> ==============
> event1 | 10:07
> event2 | 10:10
> event3 | 10:12
>
> event4 | 10:20
> event5 | 10:23
> event6 | 10:25
>
> Numbers of records is 50-100 million.
>
> Question:
> I need to get events that was during time T.
>
> For example: if T=7 munutes.
> event1 , event2 , event3 were detected durint 7 minutes.
> event4 , event5 , event6 were detected during 7 minutes.
>
> How can I implement such aggregation using map/reduce.
>
> Thanks
> Oleg.
>
--
Best wishes.
Zhiwei
Re: aggregation by time window
Posted by Oleg Ruchovets <or...@gmail.com>.
Well , much more clear , but still have a questions :-)
Suppose we have 3 map input records
event1 | 10:07
event2 | 10:10
event3 | 10:12
Output from map(event1 | 10:07) will be :
mapOutput(10:04:event1)
mapOutput(10:05:event1)
mapOutput(10:06:event1)
mapOutput(10:07:event1)
mapOutput(10:08:event1)
mapOutput(10:09:event1)
mapOutput(10:10:event1)
Output for map(event2 | 10:10) will be:
mapOutput(10:07:event2)
mapOutput(10:08:event2)
mapOutput(10:09:event2)
mapOutput(10:10:event2)
mapOutput(10:11:event2)
mapOutput(10:12:event2)
mapOutput(10:13:event2)
Output for map (event3 | 10:12) will be:
mapOutput(10:09: event3)
mapOutput(10:10 : event3)
mapOutput(10:11 : event3)
mapOutput(10:12 : event3)
mapOutput(10:13 : event3)
mapOutput(10:14 : event3)
mapOutput(10:15 : event3)
Is it correct?
If yes ,
in reducer phase I will get such inputs:
reducer(10:04:event1)
reducer(10:05:event1)
reducer(10:06:event1)
reducer(10:07:event1 ,event2)
reducer(10:08:event1 , event2)
reducer(10:09:event1 , event2 , event3)
reducer(10:10:event1 , event2 , event3)
reducer(10:11:event3)
reducer(10:12:event3)
reducer(10:13:event3)
reducer(10:14:event3)
reducer(10:15:event3)
Iterating over each reducer input how can I know at the end of aggregations
which events were during 7 minutes?
Thansk
Oleg.
On Mon, Jan 28, 2013 at 3:48 PM, Kai Voigt <k...@123.org> wrote:
> Hi again,
>
> the idea is that you emit every event multiple times. So your map input
> record (event1, 10:07) will be emitted seven times during the map() call.
> Like I said, (10:04,event1), (10:05,event1), ..., (10:10,event1) will be
> the seven outputs for processing a single event.
>
> The output key will be the time stamps in which neighbourhood or interval
> each event should be joined with events that happened +/- 3 minutes near
> it. So events which happened within a 7 minutes distance will both be
> emitted with the same time stamp as the map() output, and thus meet in a
> reduce() call.
>
> A reduce() call will look like this: reduce(10:03, list_of_events). And
> those events had time stamps between 10:00 and 10:06 in the original input.
>
> Kai
>
> Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets <or...@gmail.com>:
>
> > Hi Kai.
> > It is very interesting. Can you please explain in more details your
> > Idea?
> > What will be a key in a map phase?
> >
> > Suppose we have event at 10:07. How would you emit this to the multiple
> > buckets?
> >
> > Thanks
> > Oleg.
> >
> >
> > On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <k...@123.org> wrote:
> >
> >> Quick idea:
> >>
> >> since each of your events will go into several buckets, you could use
> >> map() to emit each item multiple times for each bucket.
> >>
> >> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <or...@gmail.com>:
> >>
> >>> Hi ,
> >>> I have such row data structure:
> >>>
> >>> event_id | time
> >>> ==============
> >>> event1 | 10:07
> >>> event2 | 10:10
> >>> event3 | 10:12
> >>>
> >>> event4 | 10:20
> >>> event5 | 10:23
> >>> event6 | 10:25
> >>
> >> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
> >> (10:10,event1) and so on.
> >>
> >> In reduce(), all your desired events would meet for the same minute.
> >>
> >> Kai
> >>
> >> --
> >> Kai Voigt
> >> k@123.org
> >>
> >>
> >>
> >>
> >>
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>
Re: aggregation by time window
Posted by Kai Voigt <k...@123.org>.
Hi again,
the idea is that you emit every event multiple times. So your map input record (event1, 10:07) will be emitted seven times during the map() call. Like I said, (10:04,event1), (10:05,event1), ..., (10:10,event1) will be the seven outputs for processing a single event.
The output key will be the time stamps in which neighbourhood or interval each event should be joined with events that happened +/- 3 minutes near it. So events which happened within a 7 minutes distance will both be emitted with the same time stamp as the map() output, and thus meet in a reduce() call.
A reduce() call will look like this: reduce(10:03, list_of_events). And those events had time stamps between 10:00 and 10:06 in the original input.
Kai
Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets <or...@gmail.com>:
> Hi Kai.
> It is very interesting. Can you please explain in more details your
> Idea?
> What will be a key in a map phase?
>
> Suppose we have event at 10:07. How would you emit this to the multiple
> buckets?
>
> Thanks
> Oleg.
>
>
> On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <k...@123.org> wrote:
>
>> Quick idea:
>>
>> since each of your events will go into several buckets, you could use
>> map() to emit each item multiple times for each bucket.
>>
>> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <or...@gmail.com>:
>>
>>> Hi ,
>>> I have such row data structure:
>>>
>>> event_id | time
>>> ==============
>>> event1 | 10:07
>>> event2 | 10:10
>>> event3 | 10:12
>>>
>>> event4 | 10:20
>>> event5 | 10:23
>>> event6 | 10:25
>>
>> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
>> (10:10,event1) and so on.
>>
>> In reduce(), all your desired events would meet for the same minute.
>>
>> Kai
>>
>> --
>> Kai Voigt
>> k@123.org
>>
>>
>>
>>
>>
--
Kai Voigt
k@123.org
Re: aggregation by time window
Posted by Oleg Ruchovets <or...@gmail.com>.
Hi Kai.
It is very interesting. Can you please explain in more details your
Idea?
What will be a key in a map phase?
Suppose we have event at 10:07. How would you emit this to the multiple
buckets?
Thanks
Oleg.
On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <k...@123.org> wrote:
> Quick idea:
>
> since each of your events will go into several buckets, you could use
> map() to emit each item multiple times for each bucket.
>
> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <or...@gmail.com>:
>
> > Hi ,
> > I have such row data structure:
> >
> > event_id | time
> > ==============
> > event1 | 10:07
> > event2 | 10:10
> > event3 | 10:12
> >
> > event4 | 10:20
> > event5 | 10:23
> > event6 | 10:25
>
> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
> (10:10,event1) and so on.
>
> In reduce(), all your desired events would meet for the same minute.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>
Re: aggregation by time window
Posted by Kai Voigt <k...@123.org>.
Quick idea:
since each of your events will go into several buckets, you could use map() to emit each item multiple times for each bucket.
Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <or...@gmail.com>:
> Hi ,
> I have such row data structure:
>
> event_id | time
> ==============
> event1 | 10:07
> event2 | 10:10
> event3 | 10:12
>
> event4 | 10:20
> event5 | 10:23
> event6 | 10:25
map(event1,10:07) would emit (10:04,event1), (10:05,event1), ..., (10:10,event1) and so on.
In reduce(), all your desired events would meet for the same minute.
Kai
--
Kai Voigt
k@123.org