You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by Casey Stella <ce...@gmail.com> on 2017/02/02 15:25:17 UTC

Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET

Ok, so how I'm going to manifest this discussion into reality is construct
3 JIRAs:

   - METRON-684: Decouple Timestamp calculation from PROFILE_GET
      - This enables pluggable timestamp computation from other functions
   - METRON-689: Create Cron-syntax based timestamp lookup for profiler to
   enable sparse windows
   - METRON-690: Create a DSL-based timestamp lookup for profiler to enable
   sparse windows

I'm starting 684 now and will probably pick 690 afterwards.  If someone
wants to do 689, I think it'd be great to have different approaches.

Thanks for the feedback.

On Tue, Jan 31, 2017 at 2:25 PM, Casey Stella <ce...@gmail.com> wrote:

> +1 to that.  Here here! :)
>
> On Tue, Jan 31, 2017 at 2:21 PM, Nick Allen <ni...@nickallen.org> wrote:
>
>> Yes, agreed.  However we specify the schedule, it would be decoupled from
>> the Profiler client functions.
>>
>> timestamps := CRON("* * * ? * foo bar")
>> profiles := PROFILE_GET("profile1", "entity1", timestamps)
>>
>>
>> Or
>>
>> timestamps := STELLAR_DSL("on every other Tuesday")
>> profiles := PROFILE_GET("profile1", "entity1", timestamps)
>>
>>
>>
>>
>> On Tue, Jan 31, 2017 at 2:01 PM, Casey Stella <ce...@gmail.com> wrote:
>>
>> > One more point, one of the reasons for decoupling the PROFILE_GET from
>> > PROFILE_LOOKUP means that we could ahve alternative implementations of
>> > PROFILE_LOOKUP.  We could have a PROFILE_LOOKUP_CRON as well.
>> >
>> > On Tue, Jan 31, 2017 at 1:43 PM, Casey Stella <ce...@gmail.com>
>> wrote:
>> >
>> > > Regarding the "?" syntax:
>> > > Wouldn't that be forking cron syntax so now we have a metron cron?  If
>> > > we're constructing our own syntax, then why not do it so that it reads
>> > like
>> > > natural language?
>> > >
>> > > Regarding the holiday problem:
>> > > Agreed, it's a smaller problem than constructing a DSL, but that's not
>> > > really the point, I think. The concern is that it would be unable to
>> be
>> > > expressed using cron syntax in a natural way without modifying cron
>> > syntax,
>> > > which would be constructing a new DSL.  If quartz has a clever way of
>> > doing
>> > > that, then I'd like to see it.  From a quick search, I haven't seen a
>> > > scheduling example with a compact syntax that shows skipping holidays
>> > with
>> > > cron syntax.
>> > >
>> > >
>> > > On Tue, Jan 31, 2017 at 1:29 PM, Nick Allen <ni...@nickallen.org>
>> wrote:
>> > >
>> > >> >
>> > >> >    - Cron syntax allows you to construct only absolute lookbacks
>> (i.e.
>> > >> >    "every tuesday at 3PM" not "every tuesday at the current hour")
>> > >>
>> > >>
>> > >> I think Cron would work for this.  I am no expert on cron
>> expressions,
>> > but
>> > >> I think the following examples would work.
>> > >>
>> > >>    - If you want "every Tuesday at 3 PM"
>> > >>       - 0 0 15 ? * TUE *
>> > >>    - If you want "every Tuesday at current hour" then use something
>> like
>> > >>    the "?" placeholder maybe.
>> > >>       - 0 0 ? ? * TUE *
>> > >>
>> > >> - Cron syntax allows you to specify a point in time, not a
>> duration.  We
>> > >> >    could, of course, specify a duration as another argument
>> > >>
>> > >>
>> > >> Yes, a separate argument would be necessary.  We would have to allow
>> the
>> > >> user to specify either a "start from date/time" or the "number of
>> > >> intervals
>> > >> to look back".
>> > >>
>> > >> Cron syntax does not allow you to skip things like holidays, etc.
>> > >>
>> > >>
>> > >> I agree, out-of-the-box Cron does not solve holiday calendars.  But
>> this
>> > >> would be a smaller problem to solve then creating our own DSL.
>> > >>
>> > >> There is a tradition of creating shortcuts that look something like
>> > @Daily
>> > >> or @Weekdays or @Tuesdays that we could also use to make things
>> easier
>> > for
>> > >> users.
>> > >>
>> > >> I have used Quartz with cron expressions in the past and there was
>> some
>> > >> way
>> > >> to handle holidays with that.  I think you could create a custom
>> > calendar
>> > >> for the holidays and call it something; aka @USHolidays.  And then
>> you
>> > >> would say "every Tuesday" except @USHolidays or something like that.
>> > I'd
>> > >> have to look into this some more.
>> > >>
>> > >> And there are also nice online Cron expression "translators" that we
>> > could
>> > >> mimic in a Metron user interface.  For example, https://crontab.guru
>> .
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Jan 31, 2017 at 12:00 PM, Casey Stella <ce...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > I actually did consider cron initially but dismissed it for the
>> > >> following
>> > >> > reasons:
>> > >> >
>> > >> >    - Cron syntax allows you to construct only absolute lookbacks
>> (i.e.
>> > >> >    "every tuesday at 3PM" not "every tuesday at the current hour")
>> > >> >    - Cron syntax allows you to specify a point in time, not a
>> > >> duration.  We
>> > >> >    could, of course, specify a duration as another argument
>> > >> >    - Cron syntax does not allow you to skip things like holidays,
>> etc.
>> > >> >
>> > >> > You could use Cron syntax as part of a broader API to specify the
>> days
>> > >> to
>> > >> > look back and have other arguments handle the aspects that cron
>> > doesn't
>> > >> > support out of the box.  I share your concern at making another
>> DSL,
>> > but
>> > >> > cron seemed to not be a complete solution and it's syntax, despite
>> > being
>> > >> > well known by admins, may not be well known to analysts.  Also, and
>> > >> this is
>> > >> > just a personal bias, I find it inscrutable without a fair amount
>> of
>> > >> > wikipedia and man page reading.
>> > >> >
>> > >> > On Tue, Jan 31, 2017 at 11:47 AM, Nick Allen <ni...@nickallen.org>
>> > >> wrote:
>> > >> >
>> > >> > > I do prefer the flexibility of the DSL, but would prefer not to
>> > create
>> > >> > yet
>> > >> > > another DSL for our users to learn.  Couldn't we somehow use cron
>> > >> > > expressions for this functionality?
>> > >> > >
>> > >> > > On Mon, Jan 23, 2017 at 3:01 PM, Casey Stella <
>> cestella@gmail.com>
>> > >> > wrote:
>> > >> > >
>> > >> > > > Hi All,
>> > >> > > >
>> > >> > > > I'm planning to expand the capabilities of PROFILE_GET and
>> wanted
>> > to
>> > >> > pass
>> > >> > > > an idea past the community.
>> > >> > > >
>> > >> > > > *Current State*
>> > >> > > >
>> > >> > > > Currently, the functionality of PROFILE_GET is fairly
>> > >> straightforward:
>> > >> > > >
>> > >> > > >    - profile - The name of the profile.
>> > >> > > >    - entity - The name of the entity.
>> > >> > > >    - durationAgo - How long ago should values be retrieved
>> from?
>> > >> > > >    - units - The units of 'durationAgo'.
>> > >> > > >    - groups_list - Optional, must correspond to the 'groupBy'
>> list
>> > >> used
>> > >> > > in
>> > >> > > >    profile creation - List (in square brackets) of groupBy
>> values
>> > >> used
>> > >> > to
>> > >> > > >    filter the profile. Default is the empty list, meaning
>> groupBy
>> > >> was
>> > >> > not
>> > >> > > > used
>> > >> > > >    when creating the profile.
>> > >> > > >    - config_overrides - Optional - Map (in curly braces) of
>> > >> name:value
>> > >> > > >    pairs, each overriding the global config parameter of the
>> same
>> > >> name.
>> > >> > > >    Default is the empty Map, meaning no overrides.
>> > >> > > >
>> > >> > > > This has the advantage of providing a relatively simple
>> mechanism
>> > to
>> > >> > > > support the dominant use-case, gathering the profiles for a
>> > trailing
>> > >> > > > window.  The issues, however, are a couple:
>> > >> > > >
>> > >> > > >    - We may need more complex semantics for specifying the
>> window
>> > >> > > >    (motivated below)
>> > >> > > >    - As such, this couples the gathering of the profiles with
>> the
>> > >> > > >    specification of the window.
>> > >> > > >
>> > >> > > > I propose to decouple these two concepts. I propose that we
>> > extract
>> > >> the
>> > >> > > > notion of the lookback into a separate, more featureful
>> function
>> > >> called
>> > >> > > > PROFILE_LOOKBACK() which could be composed with an adjusted
>> > >> > PROFILE_GET,
>> > >> > > > whose arguments look like:
>> > >> > > >
>> > >> > > >
>> > >> > > >    - profile - The name of the profile.
>> > >> > > >    - entity - The name of the entity.
>> > >> > > >    - timestamps - The list of timestamps to retrieve
>> > >> > > >    - groups_list - Optional, must correspond to the 'groupBy'
>> list
>> > >> used
>> > >> > > in
>> > >> > > >    profile creation - List (in square brackets) of groupBy
>> values
>> > >> used
>> > >> > to
>> > >> > > >    filter the profile. Default is the empty list, meaning
>> groupBy
>> > >> was
>> > >> > not
>> > >> > > > used
>> > >> > > >    when creating the profile.
>> > >> > > >    - config_overrides - Optional - Map (in curly braces) of
>> > >> name:value
>> > >> > > >    pairs, each overriding the global config parameter of the
>> same
>> > >> name.
>> > >> > > >    Default is the empty Map, meaning no overrides.
>> > >> > > >
>> > >> > > > So, PROFILE_GET would have the output of PROFILE_LOOKBACK
>> passed
>> > to
>> > >> it
>> > >> > as
>> > >> > > > its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
>> > >> > > > PROFILE_LOOKBACK(...)) ).
>> > >> > > >
>> > >> > > > *Motivation for Change*
>> > >> > > >
>> > >> > > > The justification for this is that sometimes you want to
>> compare
>> > >> time
>> > >> > > bins
>> > >> > > > for a long duration back, but you don't want to skew the data
>> by
>> > >> > > including
>> > >> > > > periods that aren't distributionally similar (due to seasonal
>> > data,
>> > >> for
>> > >> > > > instance).  You might want to compare a value to statistically
>> > >> baseline
>> > >> > > of
>> > >> > > > the median of the values for the same time window on the same
>> day
>> > >> for
>> > >> > the
>> > >> > > > last month (e.g. every tuesday at this time).
>> > >> > > >
>> > >> > > > Also, we might want a trailing window that does not start at
>> the
>> > >> > current
>> > >> > > > time (in wall-clock), but rather starts an hour back or from
>> the
>> > >> time
>> > >> > > that
>> > >> > > > the data was originally ingested.
>> > >> > > >
>> > >> > > >
>> > >> > > > *PROFILE_LOOKBACK*
>> > >> > > >
>> > >> > > > I propose that we support the following features:
>> > >> > > >
>> > >> > > >    - A starting point that is not current time
>> > >> > > >    - Sparse bins (i.e. the last hour for every tuesday for the
>> > last
>> > >> > > month)
>> > >> > > >    - The ability to skip events (e.g. weekends, holidays)
>> > >> > > >
>> > >> > > >
>> > >> > > > This would result in a new function with the following
>> arguments:
>> > >> > > >
>> > >> > > >    -
>> > >> > > >
>> > >> > > >    from - The lookback starting point (default to now)
>> > >> > > >    -
>> > >> > > >
>> > >> > > >    fromUnits - The units for the lookback starting point
>> > >> > > >    -
>> > >> > > >
>> > >> > > >    to - The ending point for the lookback window (default to
>> from
>> > +
>> > >> > > > binSize)
>> > >> > > >    -
>> > >> > > >
>> > >> > > >    toUnits - The units for the lookback ending point
>> > >> > > >    -
>> > >> > > >
>> > >> > > >    including - A list of conditions which we would skip.
>> > >> > > >    - weekend
>> > >> > > >       - holiday
>> > >> > > >       - sunday through saturday
>> > >> > > >    -
>> > >> > > >
>> > >> > > >    excluding - A list of conditions which we would skip.
>> > >> > > >    - weekend
>> > >> > > >       - holiday
>> > >> > > >       - sunday through saturday
>> > >> > > >    -
>> > >> > > >
>> > >> > > >    binSize - The size of the lookback bin
>> > >> > > >    -
>> > >> > > >
>> > >> > > >    binUnits - The units of the lookback bin
>> > >> > > >
>> > >> > > > Given the number of arguments and their complexity and the fact
>> > that
>> > >> > > many,
>> > >> > > > many are optional, I propose that either
>> > >> > > >
>> > >> > > >    - PROFILE_LOOKBACK take a Map so that we can get essentially
>> > >> named
>> > >> > > >    params in stellar.
>> > >> > > >    - PROFILE_LOOKBACK accept a string backed by a DSL to
>> express
>> > >> these
>> > >> > > >    criteria
>> > >> > > >
>> > >> > > >
>> > >> > > > Ok, so that's a lot to take in.  How about we look at some
>> > >> motivating
>> > >> > > > use-cases.
>> > >> > > >
>> > >> > > > *Base Case: A lookback of 1 hour ago*
>> > >> > > >
>> > >> > > > As a map, this would look like:
>> > >> > > >
>> > >> > > > PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )
>> > >> > > >
>> > >> > > > As a DSL this would look like:
>> > >> > > > PROFILE_LOOKBACK( '1 hour bins from now')
>> > >> > > >
>> > >> > > >
>> > >> > > > *The same time window every tuesday for the last month starting
>> > one
>> > >> > hour
>> > >> > > > ago*
>> > >> > > >
>> > >> > > > Just to make this as clear as possible, if this is run at 3PM
>> on
>> > >> Monday
>> > >> > > > January 23rd, 2017, it would include the following bins:
>> > >> > > >
>> > >> > > >    - January 17th, 2PM - 3PM
>> > >> > > >    - January 10th, 2PM - 3PM
>> > >> > > >    - January 3rd, 2PM - 3PM
>> > >> > > >    - December 27th, 2PM - 3PM
>> > >> > > >
>> > >> > > > As a map, this would look like:
>> > >> > > >
>> > >> > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' :
>> 1,
>> > >> > > 'toUnits'
>> > >> > > > : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1,
>> 'binUnits'
>> > :
>> > >> > > 'HOURS'
>> > >> > > > } )
>> > >> > > >
>> > >> > > > As a DSL this would look like:
>> > >> > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
>> > >> > > tuesdays')
>> > >> > > >
>> > >> > > > *The same time window every sunday for the last month starting
>> one
>> > >> hour
>> > >> > > ago
>> > >> > > > skipping holidays*
>> > >> > > >
>> > >> > > > Just to make this as clear as possible, if this is run at 3PM
>> on
>> > >> Monday
>> > >> > > > January 22rd, 2017, it would include the following bins:
>> > >> > > >
>> > >> > > >    - January 16th, 2PM - 3PM
>> > >> > > >    - January 9th, 2PM - 3PM
>> > >> > > >    - January 2rd, 2PM - 3PM
>> > >> > > >    - NOT December 25th
>> > >> > > >
>> > >> > > > As a map, this would look like:
>> > >> > > >
>> > >> > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' :
>> 1,
>> > >> > > 'toUnits'
>> > >> > > > : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [
>> 'holidays'
>> > ],
>> > >> > > > 'binSize' : 1, 'binUnits' : 'HOURS' } )
>> > >> > > >
>> > >> > > > As a DSL this would look like:
>> > >> > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
>> > >> > tuesdays
>> > >> > > > excluding holidays')
>> > >> > > >
>> > >> > > > *DSL vs API*
>> > >> > > >
>> > >> > > > So, here's my personal rundown of the two approaches:
>> > >> > > >
>> > >> > > > DSL:
>> > >> > > >
>> > >> > > >    - PRO
>> > >> > > >    - Clear.  As you can see, it reads like a sentence
>> > >> > > >       - Concise
>> > >> > > >    - CON:
>> > >> > > >       - More complex to implement
>> > >> > > >       - Another DSL to learn
>> > >> > > >
>> > >> > > > API:
>> > >> > > >
>> > >> > > >    - PRO
>> > >> > > >       - Simpler to implement (though marginally so, IMO)
>> > >> > > >    - CON
>> > >> > > >       - A bit more complex to understand (also, IMO)
>> > >> > > >
>> > >> > > > I'd like to solicit feedback from the community at this point:
>> > >> > > >
>> > >> > > >    - What do you think of this change?
>> > >> > > >    - Would you prefer the DSL, API or other approach?
>> > >> > > >
>> > >> > > > Thanks,
>> > >> > > >
>> > >> > > > Casey
>> > >> > > >
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > --
>> > >> > > Nick Allen <ni...@nickallen.org>
>> > >> > >
>> > >> >
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Nick Allen <ni...@nickallen.org>
>> > >>
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> Nick Allen <ni...@nickallen.org>
>>
>
>