You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Dan Di Spaltro <da...@gmail.com> on 2010/05/07 07:13:33 UTC

Rollup time series data

Right now I have a pig script to rollup timeseries data,

The current format of the data is in the following tab separated value list.
ts service-uuid service-name type value

So the first step is to take each timestamp and snap it to a period.
For 5 min rollups I use something like this:
snapped = FOREACH X Generate SnapTs(300, ts) ....

And then I group and average and count over that group which is great
and easy.  The next bit is to show the change from 0 -> 5 min  so
basically I want to take Point A avg and subtract it from Point B avg
and divide by the timestamps to get the rate of change between the
points, but I am not sure how to do that.  For instance, one idea I
had was to create another dataset like this

previous = FOREACH snapped GENERATE $0 + 300, ....

GROUP previous BY (...), snapped BY (...)

But that seems like a waste, I am just having a hard time modeling
that.  Any help would be appreciated.

Best,

-- 
Dan Di Spaltro

Re: Rollup time series data

Posted by Russell Jurney <ru...@gmail.com>.

The DateTime UDFs in PiggyBank may be helpful.  See
http://github.com/apache/pig/tree/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/datetime/truncate/

They feature date truncation, which can help to group logs into whole time
units: day/hour/minute/second, etc.  It sounds like a proper rounding
function, by datetime unit (day, hour, minute, etc.) with a number of splits
would be a good addition.  I'll try and get that in there if people think
its a good idea?

I've been meaning to add DateTime support to Pig, but don't have time in the
near future to do so :/

Russ

On Fri, May 14, 2010 at 9:44 AM, Dan Di Spaltro <da...@gmail.com>wrote:

> Thanks for the response,  I appreciate the verbose example.
>
> On Wed, May 12, 2010 at 5:43 PM, hc busy <hc...@gmail.com> wrote:
> > Well, time series data is usually regular periods (the time between one
> > timestamp and next is always 300 seconds), so you can just divide the
> delta
> > by the constant 300
> >
> > analysis = for each snapped generate ((double)delta / 300.0) as
> per_period;
> >
> > I guess what you are suggesting would work
> >
> > snapped = foreach X generate start_of_period, total_so_far;
> > snapped2 = foreach X generate start_of_period-300 as
> > start_of_previous_period, total_so_far;
> > diff_join = join snapped by start_of_period, snapped2
> > bystart_of_previous_period;
> > diff = foreach diff_join generate start_of_previous_period as
> > start_of_period, (snapped2::total_so_far - snapped::total_so_far)/300 as
> > rate_of_change;
> >
> >
> > This seem to fit the M-R paradigm, trade efficiency for  scalability.
> > Because you can use this to compute arbitrarily large dataset just by
> buying
> > twice the computer as you would otherwise need to compute it using an
> > iterator... But remember, you won't need to write *any* unit tests,
> > synchronization, file system, or operating system to make it happen, just
> > the above five lines of code.
>
> This is probably the best way of putting this possible.  Thanks for
> the input =).
>
> >
> > Does this sound right to everyone else?
> >
> >
> >
> > On Thu, May 6, 2010 at 10:13 PM, Dan Di Spaltro <dan.dispaltro@gmail.com
> >wrote:
> >
> >> Right now I have a pig script to rollup timeseries data,
> >>
> >> The current format of the data is in the following tab separated value
> >> list.
> >> ts service-uuid service-name type value
> >>
> >> So the first step is to take each timestamp and snap it to a period.
> >> For 5 min rollups I use something like this:
> >> snapped = FOREACH X Generate SnapTs(300, ts) ....
> >>
> >> And then I group and average and count over that group which is great
> >> and easy.  The next bit is to show the change from 0 -> 5 min  so
> >> basically I want to take Point A avg and subtract it from Point B avg
> >> and divide by the timestamps to get the rate of change between the
> >> points, but I am not sure how to do that.  For instance, one idea I
> >> had was to create another dataset like this
> >>
> >> previous = FOREACH snapped GENERATE $0 + 300, ....
> >>
> >> GROUP previous BY (...), snapped BY (...)
> >>
> >> But that seems like a waste, I am just having a hard time modeling
> >> that.  Any help would be appreciated.
> >>
> >> Best,
> >>
> >> --
> >> Dan Di Spaltro
> >>
> >
>
>
>
> --
> Dan Di Spaltro
>

Re: Rollup time series data

Posted by Dan Di Spaltro <da...@gmail.com>.

Thanks for the response,  I appreciate the verbose example.

On Wed, May 12, 2010 at 5:43 PM, hc busy <hc...@gmail.com> wrote:
> Well, time series data is usually regular periods (the time between one
> timestamp and next is always 300 seconds), so you can just divide the delta
> by the constant 300
>
> analysis = for each snapped generate ((double)delta / 300.0) as per_period;
>
> I guess what you are suggesting would work
>
> snapped = foreach X generate start_of_period, total_so_far;
> snapped2 = foreach X generate start_of_period-300 as
> start_of_previous_period, total_so_far;
> diff_join = join snapped by start_of_period, snapped2
> bystart_of_previous_period;
> diff = foreach diff_join generate start_of_previous_period as
> start_of_period, (snapped2::total_so_far - snapped::total_so_far)/300 as
> rate_of_change;
>
>
> This seem to fit the M-R paradigm, trade efficiency for  scalability.
> Because you can use this to compute arbitrarily large dataset just by buying
> twice the computer as you would otherwise need to compute it using an
> iterator... But remember, you won't need to write *any* unit tests,
> synchronization, file system, or operating system to make it happen, just
> the above five lines of code.

This is probably the best way of putting this possible.  Thanks for
the input =).

>
> Does this sound right to everyone else?
>
>
>
> On Thu, May 6, 2010 at 10:13 PM, Dan Di Spaltro <da...@gmail.com>wrote:
>
>> Right now I have a pig script to rollup timeseries data,
>>
>> The current format of the data is in the following tab separated value
>> list.
>> ts service-uuid service-name type value
>>
>> So the first step is to take each timestamp and snap it to a period.
>> For 5 min rollups I use something like this:
>> snapped = FOREACH X Generate SnapTs(300, ts) ....
>>
>> And then I group and average and count over that group which is great
>> and easy.  The next bit is to show the change from 0 -> 5 min  so
>> basically I want to take Point A avg and subtract it from Point B avg
>> and divide by the timestamps to get the rate of change between the
>> points, but I am not sure how to do that.  For instance, one idea I
>> had was to create another dataset like this
>>
>> previous = FOREACH snapped GENERATE $0 + 300, ....
>>
>> GROUP previous BY (...), snapped BY (...)
>>
>> But that seems like a waste, I am just having a hard time modeling
>> that.  Any help would be appreciated.
>>
>> Best,
>>
>> --
>> Dan Di Spaltro
>>
>



-- 
Dan Di Spaltro

Re: Rollup time series data

Posted by hc busy <hc...@gmail.com>.

Well, time series data is usually regular periods (the time between one
timestamp and next is always 300 seconds), so you can just divide the delta
by the constant 300

analysis = for each snapped generate ((double)delta / 300.0) as per_period;

I guess what you are suggesting would work

snapped = foreach X generate start_of_period, total_so_far;
snapped2 = foreach X generate start_of_period-300 as
start_of_previous_period, total_so_far;
diff_join = join snapped by start_of_period, snapped2
bystart_of_previous_period;
diff = foreach diff_join generate start_of_previous_period as
start_of_period, (snapped2::total_so_far - snapped::total_so_far)/300 as
rate_of_change;


This seem to fit the M-R paradigm, trade efficiency for  scalability.
Because you can use this to compute arbitrarily large dataset just by buying
twice the computer as you would otherwise need to compute it using an
iterator... But remember, you won't need to write *any* unit tests,
synchronization, file system, or operating system to make it happen, just
the above five lines of code.

Does this sound right to everyone else?



On Thu, May 6, 2010 at 10:13 PM, Dan Di Spaltro <da...@gmail.com>wrote:

> Right now I have a pig script to rollup timeseries data,
>
> The current format of the data is in the following tab separated value
> list.
> ts service-uuid service-name type value
>
> So the first step is to take each timestamp and snap it to a period.
> For 5 min rollups I use something like this:
> snapped = FOREACH X Generate SnapTs(300, ts) ....
>
> And then I group and average and count over that group which is great
> and easy.  The next bit is to show the change from 0 -> 5 min  so
> basically I want to take Point A avg and subtract it from Point B avg
> and divide by the timestamps to get the rate of change between the
> points, but I am not sure how to do that.  For instance, one idea I
> had was to create another dataset like this
>
> previous = FOREACH snapped GENERATE $0 + 300, ....
>
> GROUP previous BY (...), snapped BY (...)
>
> But that seems like a waste, I am just having a hard time modeling
> that.  Any help would be appreciated.
>
> Best,
>
> --
> Dan Di Spaltro
>