You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@metron.apache.org by Matt Foley <ma...@apache.org> on 2017/01/25 07:00:08 UTC

[DISCUSS] How to do Sliding Windows in Profiler

Hi all,

Casey and I had an interesting chat yesterday, during which we agreed that the example code for Outlier Analysis in https://github.com/apache/incubator-metron/blob/master/metron-analytics/metron-statistics/README.md and the revised example code in https://issues.apache.org/jira/browse/METRON-668 (as of 23 January) both do not correctly implement the desired Sliding Window model.  This email gives the argument for why, and proposes a couple ways to do it right.  Your input and preferences are requested.

 

First a couple statements about the STATS object that underlies most interesting Profile work:

·         The STATS object is a t-digest.  It summarizes a set of data points, such as those received during a sampling period, in a way that is nominally O(1) regardless of the input number of data points, and preserves the info about the “shape” of the statistical distribution of those points.  Not only info about averages and standard deviations, but also about medians and percentiles (which, btw, is a very different kind of information), is preserved and can be calculated correctly to within a given error epsilon.  Since it is a summary, however, time information is lost.

·         STATS objects, these digests of sampling periods, are MERGEABLE, meaning if you have a digest from time(1) to time(2), and another digest from time(2) to time (3), you can merge them and get a digest that is statistically equivalent to a digest from time(1) to time(3) continuously.

·         They are sort of idempotent, in that if you take two identical digests and merge them, you get almost the same object.  However, the result object will be scaled as summarizing twice the number of input data points.

·         Which is why it DOESN’T work to try to merge overlapping sampling periods.  To give a crude example, if you have a digest from time(1) to time(3) and another digest from time(2) to time(4), and merge them, the samples from time(2) to time(3) will be over-represented by a factor of 2x, which should be expected to skew the distribution (unless the distribution really is constant for all sub-windows – which would mean we don’t need Sliding Windows because nothing changes).

 

The Outlier Analysis profiles linked above try to implement a sliding window, in which each profile period summarizes the Median Absolute Deviation distribution of the last five profile periods only.  An “Outlier MAD Score” can then be determined by comparing the deviation of a new data point to the last MAD distribution recorded in the Profile.  This allows for changes over time in the statistical distribution of inputs, but does not make the measure unduly sensitive to just the last minute or two.  This is a typical use case for Sliding Windows.

 

Both example codes trip on how to do the sliding window in the context of a Profile.  At sampling period boundaries, both either compose the “result” digest or initialize the “next” digest by reading the previous 5 result digests.  That is wrong, because it ignores the fact that those digests aren’t just for their time periods.  They too were composed with THEIR preceding 5 digests, each of which were composed with their preceding 5 digests, which in turn… etc.  The end result is sort of like the way Madieras or some brandies are aged via the Solera process with fractional blending.  You don’t get a true sliding window, which sharply drops the past, you get a continuous dilution of the past. In fact, it’s wrong to assume that the “tail” of the far past is more diluted than the near past! – It would be with some algorithms, but the alg used in these two examples causes the far past to become an exponentially MORE important fraction of the overall data than the near past – much worse than simply turning on digesting at time(0) and leaving it on, with no attempt at windowing.  (Simulate it in a spreadsheet and you’ll see.)

 

We need a Profiler structure that assists in creating Sliding Window profiles.  The problem is that Profiles let you save only one thing (quantity or object) per sampling period, and that’s typically a different “thing” (object type or scale) than you want to use to compose the result for each windowed span.  One way to do it correctly would be with two Profiles, like this:

 

(SOLUTION A)

{

  "profiles": [

    {

      "profile": "sketchy_mad",

      "foreach": "'global'",

      "init" : {

        "s": "OUTLIER_MAD_INIT()"

        },

      "update": {

        "s": "OUTLIER_MAD_ADD(s, value)"

        },

      "result": "s"

    },

    {

      "profile": "windowed_mad",

      "foreach": "'global'",

      "init" : { },

      "update": { },

      "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',

'global', 5, 'MINUTES'))"

    }

  ]

}

 

This is typical.  You have a fine-grain sampling period that you want to “tumble”, and a broader window that you want to “slide” or “roll” along the underlying smaller sampling periods.

The above pair of profiles is clear enough, but not real efficient.  Also, you can’t guarantee they’ll stay exactly synchronized.  It would be better to combine them somehow.

 

In these examples, we only need to look back 5 sampling periods, so the following would work (NOTE: requires Casey’s PR#668 that preserves state between sampling periods): 

 

(SOLUTION B)

{

  "profiles": [

    {

      "profile": "windowed_mad",

      "foreach": "'global'",

      "init" : {

        "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",

        "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",

        "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",

       "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",

        "s0": "OUTLIER_MAD_INIT()"

        },

      "update": {

        "s0": "OUTLIER_MAD_ADD(s0, value)"

        },

      "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”

    }

  ]

}

 

This works fine for a sliding window of size 5 sample periods.  But what if it was 15?  Or 60?

No doubt we could implement a ring buffer with a stellar list object, but it’s gonna get ugly(er) and slow.

 

A much more elegant solution would be to allow “result” to write two objects.  The result could be viewed either as two Profiles, or as a Profile with multiple addressable columns of hbase output.  For instance:

(SOLUTION C)

{

  "profiles": [

    {

      "profile": "[ 'sketchy_mad', 'windowed_mad' ]",

      "foreach": "'global'",

      "init" : {

        "s": "OUTLIER_MAD_INIT()"

        },

      "update": {

        "s": "OUTLIER_MAD_ADD(s, value)"

        },

      "result": {

        "sketchy_mad": "s",

        "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s, PROFILE_GET('sketchy_mad',

'global', 4, 'MINUTES'))"

        }

    }

  ]

}

 

This is clean and tight, and I think will be easily generalized to any Sliding Window scenario.

BTW, given that each write has to work it’s way through more processing before getting to HBase, it’s best to say that all writes happen AFTER all the “result” processing, which is why I showed the MERGE above retrieving only 4 sampling periods prior to the current one, and then add the current one from local state.

 

What do you think?  Can anyone specify a better way to do Sliding Window profiles correctly with current functionality?

 

 

Casey noted that if this feature were available, he would like to consider separating the value distribution from the deviation distribution, in the OUTLIER_MAD state objects.  He also notes that the hypothetical OUTLIER_MAD_INIT() function needs to take a starting median, as argument.  So the final Sliding Window Profile for OUTLIER_MAD state tracking might look more like:

 

{

  "profiles": [

    {

      "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",

      "foreach": "'global'",

      "init" : {

        "median": "STATS_PERCENTILE(STATS_MERGE(PROFILE_GET('sketchy_values', 'global', 5, 'MINUTES')), 50)",

        "s": "OUTLIER_MAD_INIT(median)",

        "val_dist": "STATS_INIT()"

        },

      "update": {

        "val_dist": "STATS_ADD(val_dist, value)",

        "s": "OUTLIER_MAD_ADD(s, value)"

        },

      "result": {

        "sketchy_values": "val_dist",

        "sketchy_mad": "s",

        "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s, PROFILE_GET('sketchy_mad',

'global', 4, 'MINUTES'))"

        }

    },

  ]

}

 

Thanks,

--Matt

Re: [DISCUSS] How to do Sliding Windows in Profiler

Posted by Casey Stella <ce...@gmail.com>.

So, I did some digging into the implementation of Median Absolute Deviation
that we ship.  I think we're doing the right thing here for sliding
windows.  The key assumption that we were making as part of this discussion
is that were storing 2 distributions in the state:

   - The distribution of the values
   - The distribution of the deviation from the median

In fact, we are actually storing 4 distributions:

   - The distribution of the values for this tick
      - This is used ONLY when merging states
   - The distribution of the values for the whole sliding window
      - This is used ONLY for scoring
   - The distribution of the deviation from the median of the sliding
   window for this tick
      - This is used ONLY when merging states
      - The median used here is the median of the merged tick-granular
      distribution of values across the whole window
   - The distribution of the deviation from the median of the sliding
   window for the whole sliding window
      - This is used ONLY for scoring
      - The median used here is the median of the merged tick-granular
      distribution of values across the whole window


The tick-granular distributions are used for merging, so we do not get
overlapping distributions of values or deviations from medians.  The
window-granular distributions are used for the actual scoring, so you get
the full context of the window.

I would urge you guys to check my work and let me know if I interpreted it
wrong by looking at the State class here
<https://github.com/apache/incubator-metron/blob/master/metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/outlier/MedianAbsoluteDeviationFunctions.java#L33>
.

All that being said, I still think that providing the ability to structure
multiple profiles at once as Matt suggested would be extremely nice and I'm
in favor of it, on the whole.

Casey

On Wed, Jan 25, 2017 at 2:06 PM, Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> Hey guys,
>
> 1. I'm going to Google that brandy analogy later. It sounds tasty :)
> 2. The option for reading/writing multiple profiles makes sense to me and
> provides much greater flexibility and efficiency.
> 3. My gut reaction to reading through this is "wow." This seems really
> complicated. I'm all for the flexibility provided by these composable
> functions and think this would be much better for users with either some
> templates or other form of abstraction to simplify common patterns.
>
> Mike
>
>
> On Wed, Jan 25, 2017 at 8:02 AM, Casey Stella <ce...@gmail.com> wrote:
>
> > So, I think the key takeaway, at least for me, and my key feedback is
> that:
> >
> >    - We should separate the wrong behavior of MAD with the ability to
> >    improve windowing.  We can fix one without the other and, in my
> opinion,
> >    the key fix for MAD is keeping the value distribution independent of
> the
> >    MAD state, as Matt noted at the end of this email.
> >    - Because the above necessitates tracking two things, it'd be nice to
> >    adopt Matt's suggestion to be able to sort of merge profiles track two
> >    results.
> >
> > One more thing, in general, if you ever want to merge outputs on read,
> then
> > you should never write out the merged data (unless the data you're
> writing
> > has a way to untangle the overlap).  They are two separate semantics.
> >
> > On merge on read, you choose the window at read time.  An example of this
> > for a hypothetically adjusted MAD to have a OUTLIER_MAD_INIT function
> which
> > takes the value distribution would be:
> > {
> >   "profiles": [
> >     {
> >       "profile": "[ 'sketchy_values', 'sketchy_mad' ]",
> >       "foreach": "'global'",
> >       "init" : {
> >         "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> > 'global', 5, 'MINUTES')))",
> >         "val_dist": "STATS_INIT()"
> >         },
> >       "update": {
> >         "val_dist": "STATS_ADD(val_dist, value)",
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >         },
> >       "result": {
> >         "sketchy_values": "val_dist",
> >         "sketchy_mad": "s",
> >         }
> >     }
> >   ]
> > }
> >
> > You would then read this in the enrichment topology or wherever
> > via OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> > 'global', 5, 'MINUTES')), value)
> >
> >
> > On merge on write, you choose the window size on write time and should
> read
> > only the latest profile entry on read, so that would look like:
> > {
> >   "profiles": [
> >     {
> >       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
> >       "foreach": "'global'",
> >       "init" : {
> >         "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> > 'global', 5, 'MINUTES')))",
> >         "val_dist": "STATS_INIT()"
> >         },
> >       "update": {
> >         "val_dist": "STATS_ADD(val_dist, value)",
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >         },
> >       "result": {
> >         "sketchy_values": "val_dist",
> >         "sketchy_mad": "s",
> >         "windowed_mad" : "OUTLIER_MAD_STATE_MERGE(s,
> > PROFILE_GET('sketchy_mad', 'global', 4, 'MINUTES'))"
> >         }
> >     }
> >   ]
> > }
> >
> >
> > You would then score pulling only the most recent MAD state.  Merging
> > across them would double count entries, as Matt mentioned above
> > OUTLIER_MAD_SCORE(GET_LAST(PROFILE_GET('sketchy_mad', 'global', 1,
> > 'MINUTES')), value)
> >
> > So, tl;dr, I'm in favor of Matt's syntax to merge profiles and give
> > multiple outputs.  Also, different but coupled problem, MAD needs fixing
> to
> > decouple tracking value distributions from the distribution of the
> > deviation from the median.
> >
> > Casey
> >
> > On Wed, Jan 25, 2017 at 9:28 AM, Nick Allen <ni...@nickallen.org> wrote:
> >
> > > That was a lot to digest so I apologize if I have missed some of your
> > > thought process.  I probably need to read through this a couple more
> > times.
> > >
> > >
> > > Can anyone specify a better way to do Sliding Window profiles correctly
> > > > with current functionality?
> > >
> > >
> > > Could we not change the underlying implementation of the profiler to
> > > support sliding windows?  Ideally the same profile definition could be
> > > applied to either a tumbling or sliding window, rather than requiring a
> > > change in the profile definition itself.
> > >
> > > The work that I had done for METRON-590, made it possible to configure
> > > either a tumbling or sliding window.  The downside of that work is that
> > it
> > > was an all-or-nothing change, all profiles (in the same topology) were
> > > either tumbling or sliding.  But perhaps we can enhance it from there.
> > >
> > >
> > > A much more elegant solution would be to allow “result” to write two
> > > > objects.
> > >
> > >
> > > Yes, I like this! I have had the same thought myself.  This would be a
> > > simple change and would provide the user with some flexibility.  Does
> > this
> > > change solve the entire problem though?
> > >
> > >
> > > Obviously, I do need to sit down and think on this a bit more.  I
> really
> > > wish we could handle these use cases with a profile definition that
> > remains
> > > simple and easy to grok.  Thanks for spending the time to think through
> > > this.
> > >
> > > On Wed, Jan 25, 2017 at 2:00 AM, Matt Foley <ma...@apache.org> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Casey and I had an interesting chat yesterday, during which we agreed
> > > that
> > > > the example code for Outlier Analysis in https://github.com/apache/
> > > > incubator-metron/blob/master/metron-analytics/metron-
> > > statistics/README.md
> > > > and the revised example code in https://issues.apache.org/
> > > > jira/browse/METRON-668 (as of 23 January) both do not correctly
> > implement
> > > > the desired Sliding Window model.  This email gives the argument for
> > why,
> > > > and proposes a couple ways to do it right.  Your input and
> preferences
> > > are
> > > > requested.
> > > >
> > > >
> > > >
> > > > First a couple statements about the STATS object that underlies most
> > > > interesting Profile work:
> > > >
> > > > ·         The STATS object is a t-digest.  It summarizes a set of
> data
> > > > points, such as those received during a sampling period, in a way
> that
> > is
> > > > nominally O(1) regardless of the input number of data points, and
> > > preserves
> > > > the info about the “shape” of the statistical distribution of those
> > > > points.  Not only info about averages and standard deviations, but
> also
> > > > about medians and percentiles (which, btw, is a very different kind
> of
> > > > information), is preserved and can be calculated correctly to within
> a
> > > > given error epsilon.  Since it is a summary, however, time
> information
> > is
> > > > lost.
> > > >
> > > > ·         STATS objects, these digests of sampling periods, are
> > > MERGEABLE,
> > > > meaning if you have a digest from time(1) to time(2), and another
> > digest
> > > > from time(2) to time (3), you can merge them and get a digest that is
> > > > statistically equivalent to a digest from time(1) to time(3)
> > > continuously.
> > > >
> > > > ·         They are sort of idempotent, in that if you take two
> > identical
> > > > digests and merge them, you get almost the same object.  However, the
> > > > result object will be scaled as summarizing twice the number of input
> > > data
> > > > points.
> > > >
> > > > ·         Which is why it DOESN’T work to try to merge overlapping
> > > > sampling periods.  To give a crude example, if you have a digest from
> > > > time(1) to time(3) and another digest from time(2) to time(4), and
> > merge
> > > > them, the samples from time(2) to time(3) will be over-represented
> by a
> > > > factor of 2x, which should be expected to skew the distribution
> (unless
> > > the
> > > > distribution really is constant for all sub-windows – which would
> mean
> > we
> > > > don’t need Sliding Windows because nothing changes).
> > > >
> > > >
> > > >
> > > > The Outlier Analysis profiles linked above try to implement a sliding
> > > > window, in which each profile period summarizes the Median Absolute
> > > > Deviation distribution of the last five profile periods only.  An
> > > “Outlier
> > > > MAD Score” can then be determined by comparing the deviation of a new
> > > data
> > > > point to the last MAD distribution recorded in the Profile.  This
> > allows
> > > > for changes over time in the statistical distribution of inputs, but
> > does
> > > > not make the measure unduly sensitive to just the last minute or two.
> > > This
> > > > is a typical use case for Sliding Windows.
> > > >
> > > >
> > > >
> > > > Both example codes trip on how to do the sliding window in the
> context
> > of
> > > > a Profile.  At sampling period boundaries, both either compose the
> > > “result”
> > > > digest or initialize the “next” digest by reading the previous 5
> result
> > > > digests.  That is wrong, because it ignores the fact that those
> digests
> > > > aren’t just for their time periods.  They too were composed with
> THEIR
> > > > preceding 5 digests, each of which were composed with their
> preceding 5
> > > > digests, which in turn… etc.  The end result is sort of like the way
> > > > Madieras or some brandies are aged via the Solera process with
> > fractional
> > > > blending.  You don’t get a true sliding window, which sharply drops
> the
> > > > past, you get a continuous dilution of the past. In fact, it’s wrong
> to
> > > > assume that the “tail” of the far past is more diluted than the near
> > > past!
> > > > – It would be with some algorithms, but the alg used in these two
> > > examples
> > > > causes the far past to become an exponentially MORE important
> fraction
> > of
> > > > the overall data than the near past – much worse than simply turning
> on
> > > > digesting at time(0) and leaving it on, with no attempt at windowing.
> > > > (Simulate it in a spreadsheet and you’ll see.)
> > > >
> > > >
> > > >
> > > > We need a Profiler structure that assists in creating Sliding Window
> > > > profiles.  The problem is that Profiles let you save only one thing
> > > > (quantity or object) per sampling period, and that’s typically a
> > > different
> > > > “thing” (object type or scale) than you want to use to compose the
> > result
> > > > for each windowed span.  One way to do it correctly would be with two
> > > > Profiles, like this:
> > > >
> > > >
> > > >
> > > > (SOLUTION A)
> > > >
> > > > {
> > > >
> > > >   "profiles": [
> > > >
> > > >     {
> > > >
> > > >       "profile": "sketchy_mad",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : {
> > > >
> > > >         "s": "OUTLIER_MAD_INIT()"
> > > >
> > > >         },
> > > >
> > > >       "update": {
> > > >
> > > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > > >
> > > >         },
> > > >
> > > >       "result": "s"
> > > >
> > > >     },
> > > >
> > > >     {
> > > >
> > > >       "profile": "windowed_mad",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : { },
> > > >
> > > >       "update": { },
> > > >
> > > >       "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> > > >
> > > > 'global', 5, 'MINUTES'))"
> > > >
> > > >     }
> > > >
> > > >   ]
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > This is typical.  You have a fine-grain sampling period that you want
> > to
> > > > “tumble”, and a broader window that you want to “slide” or “roll”
> along
> > > the
> > > > underlying smaller sampling periods.
> > > >
> > > > The above pair of profiles is clear enough, but not real efficient.
> > > Also,
> > > > you can’t guarantee they’ll stay exactly synchronized.  It would be
> > > better
> > > > to combine them somehow.
> > > >
> > > >
> > > >
> > > > In these examples, we only need to look back 5 sampling periods, so
> the
> > > > following would work (NOTE: requires Casey’s PR#668 that preserves
> > state
> > > > between sampling periods):
> > > >
> > > >
> > > >
> > > > (SOLUTION B)
> > > >
> > > > {
> > > >
> > > >   "profiles": [
> > > >
> > > >     {
> > > >
> > > >       "profile": "windowed_mad",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : {
> > > >
> > > >         "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",
> > > >
> > > >         "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",
> > > >
> > > >         "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",
> > > >
> > > >        "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",
> > > >
> > > >         "s0": "OUTLIER_MAD_INIT()"
> > > >
> > > >         },
> > > >
> > > >       "update": {
> > > >
> > > >         "s0": "OUTLIER_MAD_ADD(s0, value)"
> > > >
> > > >         },
> > > >
> > > >       "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”
> > > >
> > > >     }
> > > >
> > > >   ]
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > This works fine for a sliding window of size 5 sample periods.  But
> > what
> > > > if it was 15?  Or 60?
> > > >
> > > > No doubt we could implement a ring buffer with a stellar list object,
> > but
> > > > it’s gonna get ugly(er) and slow.
> > > >
> > > >
> > > >
> > > > A much more elegant solution would be to allow “result” to write two
> > > > objects.  The result could be viewed either as two Profiles, or as a
> > > > Profile with multiple addressable columns of hbase output.  For
> > instance:
> > > >
> > > > (SOLUTION C)
> > > >
> > > > {
> > > >
> > > >   "profiles": [
> > > >
> > > >     {
> > > >
> > > >       "profile": "[ 'sketchy_mad', 'windowed_mad' ]",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : {
> > > >
> > > >         "s": "OUTLIER_MAD_INIT()"
> > > >
> > > >         },
> > > >
> > > >       "update": {
> > > >
> > > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > > >
> > > >         },
> > > >
> > > >       "result": {
> > > >
> > > >         "sketchy_mad": "s",
> > > >
> > > >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > > > PROFILE_GET('sketchy_mad',
> > > >
> > > > 'global', 4, 'MINUTES'))"
> > > >
> > > >         }
> > > >
> > > >     }
> > > >
> > > >   ]
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > This is clean and tight, and I think will be easily generalized to
> any
> > > > Sliding Window scenario.
> > > >
> > > > BTW, given that each write has to work it’s way through more
> processing
> > > > before getting to HBase, it’s best to say that all writes happen
> AFTER
> > > all
> > > > the “result” processing, which is why I showed the MERGE above
> > retrieving
> > > > only 4 sampling periods prior to the current one, and then add the
> > > current
> > > > one from local state.
> > > >
> > > >
> > > >
> > > > What do you think?  Can anyone specify a better way to do Sliding
> > Window
> > > > profiles correctly with current functionality?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Casey noted that if this feature were available, he would like to
> > > consider
> > > > separating the value distribution from the deviation distribution, in
> > the
> > > > OUTLIER_MAD state objects.  He also notes that the hypothetical
> > > > OUTLIER_MAD_INIT() function needs to take a starting median, as
> > argument.
> > > > So the final Sliding Window Profile for OUTLIER_MAD state tracking
> > might
> > > > look more like:
> > > >
> > > >
> > > >
> > > > {
> > > >
> > > >   "profiles": [
> > > >
> > > >     {
> > > >
> > > >       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad'
> ]",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : {
> > > >
> > > >         "median": "STATS_PERCENTILE(STATS_MERGE(
> > > PROFILE_GET('sketchy_values',
> > > > 'global', 5, 'MINUTES')), 50)",
> > > >
> > > >         "s": "OUTLIER_MAD_INIT(median)",
> > > >
> > > >         "val_dist": "STATS_INIT()"
> > > >
> > > >         },
> > > >
> > > >       "update": {
> > > >
> > > >         "val_dist": "STATS_ADD(val_dist, value)",
> > > >
> > > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > > >
> > > >         },
> > > >
> > > >       "result": {
> > > >
> > > >         "sketchy_values": "val_dist",
> > > >
> > > >         "sketchy_mad": "s",
> > > >
> > > >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > > > PROFILE_GET('sketchy_mad',
> > > >
> > > > 'global', 4, 'MINUTES'))"
> > > >
> > > >         }
> > > >
> > > >     },
> > > >
> > > >   ]
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > --Matt
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Nick Allen <ni...@nickallen.org>
> > >
> >
>

Re: [DISCUSS] How to do Sliding Windows in Profiler

Posted by Michael Miklavcic <mi...@gmail.com>.

Hey guys,

1. I'm going to Google that brandy analogy later. It sounds tasty :)
2. The option for reading/writing multiple profiles makes sense to me and
provides much greater flexibility and efficiency.
3. My gut reaction to reading through this is "wow." This seems really
complicated. I'm all for the flexibility provided by these composable
functions and think this would be much better for users with either some
templates or other form of abstraction to simplify common patterns.

Mike


On Wed, Jan 25, 2017 at 8:02 AM, Casey Stella <ce...@gmail.com> wrote:

> So, I think the key takeaway, at least for me, and my key feedback is that:
>
>    - We should separate the wrong behavior of MAD with the ability to
>    improve windowing.  We can fix one without the other and, in my opinion,
>    the key fix for MAD is keeping the value distribution independent of the
>    MAD state, as Matt noted at the end of this email.
>    - Because the above necessitates tracking two things, it'd be nice to
>    adopt Matt's suggestion to be able to sort of merge profiles track two
>    results.
>
> One more thing, in general, if you ever want to merge outputs on read, then
> you should never write out the merged data (unless the data you're writing
> has a way to untangle the overlap).  They are two separate semantics.
>
> On merge on read, you choose the window at read time.  An example of this
> for a hypothetically adjusted MAD to have a OUTLIER_MAD_INIT function which
> takes the value distribution would be:
> {
>   "profiles": [
>     {
>       "profile": "[ 'sketchy_values', 'sketchy_mad' ]",
>       "foreach": "'global'",
>       "init" : {
>         "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> 'global', 5, 'MINUTES')))",
>         "val_dist": "STATS_INIT()"
>         },
>       "update": {
>         "val_dist": "STATS_ADD(val_dist, value)",
>         "s": "OUTLIER_MAD_ADD(s, value)"
>         },
>       "result": {
>         "sketchy_values": "val_dist",
>         "sketchy_mad": "s",
>         }
>     }
>   ]
> }
>
> You would then read this in the enrichment topology or wherever
> via OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> 'global', 5, 'MINUTES')), value)
>
>
> On merge on write, you choose the window size on write time and should read
> only the latest profile entry on read, so that would look like:
> {
>   "profiles": [
>     {
>       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
>       "foreach": "'global'",
>       "init" : {
>         "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> 'global', 5, 'MINUTES')))",
>         "val_dist": "STATS_INIT()"
>         },
>       "update": {
>         "val_dist": "STATS_ADD(val_dist, value)",
>         "s": "OUTLIER_MAD_ADD(s, value)"
>         },
>       "result": {
>         "sketchy_values": "val_dist",
>         "sketchy_mad": "s",
>         "windowed_mad" : "OUTLIER_MAD_STATE_MERGE(s,
> PROFILE_GET('sketchy_mad', 'global', 4, 'MINUTES'))"
>         }
>     }
>   ]
> }
>
>
> You would then score pulling only the most recent MAD state.  Merging
> across them would double count entries, as Matt mentioned above
> OUTLIER_MAD_SCORE(GET_LAST(PROFILE_GET('sketchy_mad', 'global', 1,
> 'MINUTES')), value)
>
> So, tl;dr, I'm in favor of Matt's syntax to merge profiles and give
> multiple outputs.  Also, different but coupled problem, MAD needs fixing to
> decouple tracking value distributions from the distribution of the
> deviation from the median.
>
> Casey
>
> On Wed, Jan 25, 2017 at 9:28 AM, Nick Allen <ni...@nickallen.org> wrote:
>
> > That was a lot to digest so I apologize if I have missed some of your
> > thought process.  I probably need to read through this a couple more
> times.
> >
> >
> > Can anyone specify a better way to do Sliding Window profiles correctly
> > > with current functionality?
> >
> >
> > Could we not change the underlying implementation of the profiler to
> > support sliding windows?  Ideally the same profile definition could be
> > applied to either a tumbling or sliding window, rather than requiring a
> > change in the profile definition itself.
> >
> > The work that I had done for METRON-590, made it possible to configure
> > either a tumbling or sliding window.  The downside of that work is that
> it
> > was an all-or-nothing change, all profiles (in the same topology) were
> > either tumbling or sliding.  But perhaps we can enhance it from there.
> >
> >
> > A much more elegant solution would be to allow “result” to write two
> > > objects.
> >
> >
> > Yes, I like this! I have had the same thought myself.  This would be a
> > simple change and would provide the user with some flexibility.  Does
> this
> > change solve the entire problem though?
> >
> >
> > Obviously, I do need to sit down and think on this a bit more.  I really
> > wish we could handle these use cases with a profile definition that
> remains
> > simple and easy to grok.  Thanks for spending the time to think through
> > this.
> >
> > On Wed, Jan 25, 2017 at 2:00 AM, Matt Foley <ma...@apache.org> wrote:
> >
> > > Hi all,
> > >
> > > Casey and I had an interesting chat yesterday, during which we agreed
> > that
> > > the example code for Outlier Analysis in https://github.com/apache/
> > > incubator-metron/blob/master/metron-analytics/metron-
> > statistics/README.md
> > > and the revised example code in https://issues.apache.org/
> > > jira/browse/METRON-668 (as of 23 January) both do not correctly
> implement
> > > the desired Sliding Window model.  This email gives the argument for
> why,
> > > and proposes a couple ways to do it right.  Your input and preferences
> > are
> > > requested.
> > >
> > >
> > >
> > > First a couple statements about the STATS object that underlies most
> > > interesting Profile work:
> > >
> > > ·         The STATS object is a t-digest.  It summarizes a set of data
> > > points, such as those received during a sampling period, in a way that
> is
> > > nominally O(1) regardless of the input number of data points, and
> > preserves
> > > the info about the “shape” of the statistical distribution of those
> > > points.  Not only info about averages and standard deviations, but also
> > > about medians and percentiles (which, btw, is a very different kind of
> > > information), is preserved and can be calculated correctly to within a
> > > given error epsilon.  Since it is a summary, however, time information
> is
> > > lost.
> > >
> > > ·         STATS objects, these digests of sampling periods, are
> > MERGEABLE,
> > > meaning if you have a digest from time(1) to time(2), and another
> digest
> > > from time(2) to time (3), you can merge them and get a digest that is
> > > statistically equivalent to a digest from time(1) to time(3)
> > continuously.
> > >
> > > ·         They are sort of idempotent, in that if you take two
> identical
> > > digests and merge them, you get almost the same object.  However, the
> > > result object will be scaled as summarizing twice the number of input
> > data
> > > points.
> > >
> > > ·         Which is why it DOESN’T work to try to merge overlapping
> > > sampling periods.  To give a crude example, if you have a digest from
> > > time(1) to time(3) and another digest from time(2) to time(4), and
> merge
> > > them, the samples from time(2) to time(3) will be over-represented by a
> > > factor of 2x, which should be expected to skew the distribution (unless
> > the
> > > distribution really is constant for all sub-windows – which would mean
> we
> > > don’t need Sliding Windows because nothing changes).
> > >
> > >
> > >
> > > The Outlier Analysis profiles linked above try to implement a sliding
> > > window, in which each profile period summarizes the Median Absolute
> > > Deviation distribution of the last five profile periods only.  An
> > “Outlier
> > > MAD Score” can then be determined by comparing the deviation of a new
> > data
> > > point to the last MAD distribution recorded in the Profile.  This
> allows
> > > for changes over time in the statistical distribution of inputs, but
> does
> > > not make the measure unduly sensitive to just the last minute or two.
> > This
> > > is a typical use case for Sliding Windows.
> > >
> > >
> > >
> > > Both example codes trip on how to do the sliding window in the context
> of
> > > a Profile.  At sampling period boundaries, both either compose the
> > “result”
> > > digest or initialize the “next” digest by reading the previous 5 result
> > > digests.  That is wrong, because it ignores the fact that those digests
> > > aren’t just for their time periods.  They too were composed with THEIR
> > > preceding 5 digests, each of which were composed with their preceding 5
> > > digests, which in turn… etc.  The end result is sort of like the way
> > > Madieras or some brandies are aged via the Solera process with
> fractional
> > > blending.  You don’t get a true sliding window, which sharply drops the
> > > past, you get a continuous dilution of the past. In fact, it’s wrong to
> > > assume that the “tail” of the far past is more diluted than the near
> > past!
> > > – It would be with some algorithms, but the alg used in these two
> > examples
> > > causes the far past to become an exponentially MORE important fraction
> of
> > > the overall data than the near past – much worse than simply turning on
> > > digesting at time(0) and leaving it on, with no attempt at windowing.
> > > (Simulate it in a spreadsheet and you’ll see.)
> > >
> > >
> > >
> > > We need a Profiler structure that assists in creating Sliding Window
> > > profiles.  The problem is that Profiles let you save only one thing
> > > (quantity or object) per sampling period, and that’s typically a
> > different
> > > “thing” (object type or scale) than you want to use to compose the
> result
> > > for each windowed span.  One way to do it correctly would be with two
> > > Profiles, like this:
> > >
> > >
> > >
> > > (SOLUTION A)
> > >
> > > {
> > >
> > >   "profiles": [
> > >
> > >     {
> > >
> > >       "profile": "sketchy_mad",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : {
> > >
> > >         "s": "OUTLIER_MAD_INIT()"
> > >
> > >         },
> > >
> > >       "update": {
> > >
> > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > >
> > >         },
> > >
> > >       "result": "s"
> > >
> > >     },
> > >
> > >     {
> > >
> > >       "profile": "windowed_mad",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : { },
> > >
> > >       "update": { },
> > >
> > >       "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> > >
> > > 'global', 5, 'MINUTES'))"
> > >
> > >     }
> > >
> > >   ]
> > >
> > > }
> > >
> > >
> > >
> > > This is typical.  You have a fine-grain sampling period that you want
> to
> > > “tumble”, and a broader window that you want to “slide” or “roll” along
> > the
> > > underlying smaller sampling periods.
> > >
> > > The above pair of profiles is clear enough, but not real efficient.
> > Also,
> > > you can’t guarantee they’ll stay exactly synchronized.  It would be
> > better
> > > to combine them somehow.
> > >
> > >
> > >
> > > In these examples, we only need to look back 5 sampling periods, so the
> > > following would work (NOTE: requires Casey’s PR#668 that preserves
> state
> > > between sampling periods):
> > >
> > >
> > >
> > > (SOLUTION B)
> > >
> > > {
> > >
> > >   "profiles": [
> > >
> > >     {
> > >
> > >       "profile": "windowed_mad",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : {
> > >
> > >         "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",
> > >
> > >         "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",
> > >
> > >         "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",
> > >
> > >        "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",
> > >
> > >         "s0": "OUTLIER_MAD_INIT()"
> > >
> > >         },
> > >
> > >       "update": {
> > >
> > >         "s0": "OUTLIER_MAD_ADD(s0, value)"
> > >
> > >         },
> > >
> > >       "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”
> > >
> > >     }
> > >
> > >   ]
> > >
> > > }
> > >
> > >
> > >
> > > This works fine for a sliding window of size 5 sample periods.  But
> what
> > > if it was 15?  Or 60?
> > >
> > > No doubt we could implement a ring buffer with a stellar list object,
> but
> > > it’s gonna get ugly(er) and slow.
> > >
> > >
> > >
> > > A much more elegant solution would be to allow “result” to write two
> > > objects.  The result could be viewed either as two Profiles, or as a
> > > Profile with multiple addressable columns of hbase output.  For
> instance:
> > >
> > > (SOLUTION C)
> > >
> > > {
> > >
> > >   "profiles": [
> > >
> > >     {
> > >
> > >       "profile": "[ 'sketchy_mad', 'windowed_mad' ]",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : {
> > >
> > >         "s": "OUTLIER_MAD_INIT()"
> > >
> > >         },
> > >
> > >       "update": {
> > >
> > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > >
> > >         },
> > >
> > >       "result": {
> > >
> > >         "sketchy_mad": "s",
> > >
> > >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > > PROFILE_GET('sketchy_mad',
> > >
> > > 'global', 4, 'MINUTES'))"
> > >
> > >         }
> > >
> > >     }
> > >
> > >   ]
> > >
> > > }
> > >
> > >
> > >
> > > This is clean and tight, and I think will be easily generalized to any
> > > Sliding Window scenario.
> > >
> > > BTW, given that each write has to work it’s way through more processing
> > > before getting to HBase, it’s best to say that all writes happen AFTER
> > all
> > > the “result” processing, which is why I showed the MERGE above
> retrieving
> > > only 4 sampling periods prior to the current one, and then add the
> > current
> > > one from local state.
> > >
> > >
> > >
> > > What do you think?  Can anyone specify a better way to do Sliding
> Window
> > > profiles correctly with current functionality?
> > >
> > >
> > >
> > >
> > >
> > > Casey noted that if this feature were available, he would like to
> > consider
> > > separating the value distribution from the deviation distribution, in
> the
> > > OUTLIER_MAD state objects.  He also notes that the hypothetical
> > > OUTLIER_MAD_INIT() function needs to take a starting median, as
> argument.
> > > So the final Sliding Window Profile for OUTLIER_MAD state tracking
> might
> > > look more like:
> > >
> > >
> > >
> > > {
> > >
> > >   "profiles": [
> > >
> > >     {
> > >
> > >       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : {
> > >
> > >         "median": "STATS_PERCENTILE(STATS_MERGE(
> > PROFILE_GET('sketchy_values',
> > > 'global', 5, 'MINUTES')), 50)",
> > >
> > >         "s": "OUTLIER_MAD_INIT(median)",
> > >
> > >         "val_dist": "STATS_INIT()"
> > >
> > >         },
> > >
> > >       "update": {
> > >
> > >         "val_dist": "STATS_ADD(val_dist, value)",
> > >
> > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > >
> > >         },
> > >
> > >       "result": {
> > >
> > >         "sketchy_values": "val_dist",
> > >
> > >         "sketchy_mad": "s",
> > >
> > >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > > PROFILE_GET('sketchy_mad',
> > >
> > > 'global', 4, 'MINUTES'))"
> > >
> > >         }
> > >
> > >     },
> > >
> > >   ]
> > >
> > > }
> > >
> > >
> > >
> > > Thanks,
> > >
> > > --Matt
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Nick Allen <ni...@nickallen.org>
> >
>

Re: [DISCUSS] How to do Sliding Windows in Profiler

Posted by Casey Stella <ce...@gmail.com>.

So, I think the key takeaway, at least for me, and my key feedback is that:

   - We should separate the wrong behavior of MAD with the ability to
   improve windowing.  We can fix one without the other and, in my opinion,
   the key fix for MAD is keeping the value distribution independent of the
   MAD state, as Matt noted at the end of this email.
   - Because the above necessitates tracking two things, it'd be nice to
   adopt Matt's suggestion to be able to sort of merge profiles track two
   results.

One more thing, in general, if you ever want to merge outputs on read, then
you should never write out the merged data (unless the data you're writing
has a way to untangle the overlap).  They are two separate semantics.

On merge on read, you choose the window at read time.  An example of this
for a hypothetically adjusted MAD to have a OUTLIER_MAD_INIT function which
takes the value distribution would be:
{
  "profiles": [
    {
      "profile": "[ 'sketchy_values', 'sketchy_mad' ]",
      "foreach": "'global'",
      "init" : {
        "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
'global', 5, 'MINUTES')))",
        "val_dist": "STATS_INIT()"
        },
      "update": {
        "val_dist": "STATS_ADD(val_dist, value)",
        "s": "OUTLIER_MAD_ADD(s, value)"
        },
      "result": {
        "sketchy_values": "val_dist",
        "sketchy_mad": "s",
        }
    }
  ]
}

You would then read this in the enrichment topology or wherever
via OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
'global', 5, 'MINUTES')), value)


On merge on write, you choose the window size on write time and should read
only the latest profile entry on read, so that would look like:
{
  "profiles": [
    {
      "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
      "foreach": "'global'",
      "init" : {
        "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
'global', 5, 'MINUTES')))",
        "val_dist": "STATS_INIT()"
        },
      "update": {
        "val_dist": "STATS_ADD(val_dist, value)",
        "s": "OUTLIER_MAD_ADD(s, value)"
        },
      "result": {
        "sketchy_values": "val_dist",
        "sketchy_mad": "s",
        "windowed_mad" : "OUTLIER_MAD_STATE_MERGE(s,
PROFILE_GET('sketchy_mad', 'global', 4, 'MINUTES'))"
        }
    }
  ]
}


You would then score pulling only the most recent MAD state.  Merging
across them would double count entries, as Matt mentioned above
OUTLIER_MAD_SCORE(GET_LAST(PROFILE_GET('sketchy_mad', 'global', 1,
'MINUTES')), value)

So, tl;dr, I'm in favor of Matt's syntax to merge profiles and give
multiple outputs.  Also, different but coupled problem, MAD needs fixing to
decouple tracking value distributions from the distribution of the
deviation from the median.

Casey

On Wed, Jan 25, 2017 at 9:28 AM, Nick Allen <ni...@nickallen.org> wrote:

> That was a lot to digest so I apologize if I have missed some of your
> thought process.  I probably need to read through this a couple more times.
>
>
> Can anyone specify a better way to do Sliding Window profiles correctly
> > with current functionality?
>
>
> Could we not change the underlying implementation of the profiler to
> support sliding windows?  Ideally the same profile definition could be
> applied to either a tumbling or sliding window, rather than requiring a
> change in the profile definition itself.
>
> The work that I had done for METRON-590, made it possible to configure
> either a tumbling or sliding window.  The downside of that work is that it
> was an all-or-nothing change, all profiles (in the same topology) were
> either tumbling or sliding.  But perhaps we can enhance it from there.
>
>
> A much more elegant solution would be to allow “result” to write two
> > objects.
>
>
> Yes, I like this! I have had the same thought myself.  This would be a
> simple change and would provide the user with some flexibility.  Does this
> change solve the entire problem though?
>
>
> Obviously, I do need to sit down and think on this a bit more.  I really
> wish we could handle these use cases with a profile definition that remains
> simple and easy to grok.  Thanks for spending the time to think through
> this.
>
> On Wed, Jan 25, 2017 at 2:00 AM, Matt Foley <ma...@apache.org> wrote:
>
> > Hi all,
> >
> > Casey and I had an interesting chat yesterday, during which we agreed
> that
> > the example code for Outlier Analysis in https://github.com/apache/
> > incubator-metron/blob/master/metron-analytics/metron-
> statistics/README.md
> > and the revised example code in https://issues.apache.org/
> > jira/browse/METRON-668 (as of 23 January) both do not correctly implement
> > the desired Sliding Window model.  This email gives the argument for why,
> > and proposes a couple ways to do it right.  Your input and preferences
> are
> > requested.
> >
> >
> >
> > First a couple statements about the STATS object that underlies most
> > interesting Profile work:
> >
> > ·         The STATS object is a t-digest.  It summarizes a set of data
> > points, such as those received during a sampling period, in a way that is
> > nominally O(1) regardless of the input number of data points, and
> preserves
> > the info about the “shape” of the statistical distribution of those
> > points.  Not only info about averages and standard deviations, but also
> > about medians and percentiles (which, btw, is a very different kind of
> > information), is preserved and can be calculated correctly to within a
> > given error epsilon.  Since it is a summary, however, time information is
> > lost.
> >
> > ·         STATS objects, these digests of sampling periods, are
> MERGEABLE,
> > meaning if you have a digest from time(1) to time(2), and another digest
> > from time(2) to time (3), you can merge them and get a digest that is
> > statistically equivalent to a digest from time(1) to time(3)
> continuously.
> >
> > ·         They are sort of idempotent, in that if you take two identical
> > digests and merge them, you get almost the same object.  However, the
> > result object will be scaled as summarizing twice the number of input
> data
> > points.
> >
> > ·         Which is why it DOESN’T work to try to merge overlapping
> > sampling periods.  To give a crude example, if you have a digest from
> > time(1) to time(3) and another digest from time(2) to time(4), and merge
> > them, the samples from time(2) to time(3) will be over-represented by a
> > factor of 2x, which should be expected to skew the distribution (unless
> the
> > distribution really is constant for all sub-windows – which would mean we
> > don’t need Sliding Windows because nothing changes).
> >
> >
> >
> > The Outlier Analysis profiles linked above try to implement a sliding
> > window, in which each profile period summarizes the Median Absolute
> > Deviation distribution of the last five profile periods only.  An
> “Outlier
> > MAD Score” can then be determined by comparing the deviation of a new
> data
> > point to the last MAD distribution recorded in the Profile.  This allows
> > for changes over time in the statistical distribution of inputs, but does
> > not make the measure unduly sensitive to just the last minute or two.
> This
> > is a typical use case for Sliding Windows.
> >
> >
> >
> > Both example codes trip on how to do the sliding window in the context of
> > a Profile.  At sampling period boundaries, both either compose the
> “result”
> > digest or initialize the “next” digest by reading the previous 5 result
> > digests.  That is wrong, because it ignores the fact that those digests
> > aren’t just for their time periods.  They too were composed with THEIR
> > preceding 5 digests, each of which were composed with their preceding 5
> > digests, which in turn… etc.  The end result is sort of like the way
> > Madieras or some brandies are aged via the Solera process with fractional
> > blending.  You don’t get a true sliding window, which sharply drops the
> > past, you get a continuous dilution of the past. In fact, it’s wrong to
> > assume that the “tail” of the far past is more diluted than the near
> past!
> > – It would be with some algorithms, but the alg used in these two
> examples
> > causes the far past to become an exponentially MORE important fraction of
> > the overall data than the near past – much worse than simply turning on
> > digesting at time(0) and leaving it on, with no attempt at windowing.
> > (Simulate it in a spreadsheet and you’ll see.)
> >
> >
> >
> > We need a Profiler structure that assists in creating Sliding Window
> > profiles.  The problem is that Profiles let you save only one thing
> > (quantity or object) per sampling period, and that’s typically a
> different
> > “thing” (object type or scale) than you want to use to compose the result
> > for each windowed span.  One way to do it correctly would be with two
> > Profiles, like this:
> >
> >
> >
> > (SOLUTION A)
> >
> > {
> >
> >   "profiles": [
> >
> >     {
> >
> >       "profile": "sketchy_mad",
> >
> >       "foreach": "'global'",
> >
> >       "init" : {
> >
> >         "s": "OUTLIER_MAD_INIT()"
> >
> >         },
> >
> >       "update": {
> >
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >
> >         },
> >
> >       "result": "s"
> >
> >     },
> >
> >     {
> >
> >       "profile": "windowed_mad",
> >
> >       "foreach": "'global'",
> >
> >       "init" : { },
> >
> >       "update": { },
> >
> >       "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> >
> > 'global', 5, 'MINUTES'))"
> >
> >     }
> >
> >   ]
> >
> > }
> >
> >
> >
> > This is typical.  You have a fine-grain sampling period that you want to
> > “tumble”, and a broader window that you want to “slide” or “roll” along
> the
> > underlying smaller sampling periods.
> >
> > The above pair of profiles is clear enough, but not real efficient.
> Also,
> > you can’t guarantee they’ll stay exactly synchronized.  It would be
> better
> > to combine them somehow.
> >
> >
> >
> > In these examples, we only need to look back 5 sampling periods, so the
> > following would work (NOTE: requires Casey’s PR#668 that preserves state
> > between sampling periods):
> >
> >
> >
> > (SOLUTION B)
> >
> > {
> >
> >   "profiles": [
> >
> >     {
> >
> >       "profile": "windowed_mad",
> >
> >       "foreach": "'global'",
> >
> >       "init" : {
> >
> >         "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",
> >
> >         "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",
> >
> >         "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",
> >
> >        "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",
> >
> >         "s0": "OUTLIER_MAD_INIT()"
> >
> >         },
> >
> >       "update": {
> >
> >         "s0": "OUTLIER_MAD_ADD(s0, value)"
> >
> >         },
> >
> >       "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”
> >
> >     }
> >
> >   ]
> >
> > }
> >
> >
> >
> > This works fine for a sliding window of size 5 sample periods.  But what
> > if it was 15?  Or 60?
> >
> > No doubt we could implement a ring buffer with a stellar list object, but
> > it’s gonna get ugly(er) and slow.
> >
> >
> >
> > A much more elegant solution would be to allow “result” to write two
> > objects.  The result could be viewed either as two Profiles, or as a
> > Profile with multiple addressable columns of hbase output.  For instance:
> >
> > (SOLUTION C)
> >
> > {
> >
> >   "profiles": [
> >
> >     {
> >
> >       "profile": "[ 'sketchy_mad', 'windowed_mad' ]",
> >
> >       "foreach": "'global'",
> >
> >       "init" : {
> >
> >         "s": "OUTLIER_MAD_INIT()"
> >
> >         },
> >
> >       "update": {
> >
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >
> >         },
> >
> >       "result": {
> >
> >         "sketchy_mad": "s",
> >
> >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > PROFILE_GET('sketchy_mad',
> >
> > 'global', 4, 'MINUTES'))"
> >
> >         }
> >
> >     }
> >
> >   ]
> >
> > }
> >
> >
> >
> > This is clean and tight, and I think will be easily generalized to any
> > Sliding Window scenario.
> >
> > BTW, given that each write has to work it’s way through more processing
> > before getting to HBase, it’s best to say that all writes happen AFTER
> all
> > the “result” processing, which is why I showed the MERGE above retrieving
> > only 4 sampling periods prior to the current one, and then add the
> current
> > one from local state.
> >
> >
> >
> > What do you think?  Can anyone specify a better way to do Sliding Window
> > profiles correctly with current functionality?
> >
> >
> >
> >
> >
> > Casey noted that if this feature were available, he would like to
> consider
> > separating the value distribution from the deviation distribution, in the
> > OUTLIER_MAD state objects.  He also notes that the hypothetical
> > OUTLIER_MAD_INIT() function needs to take a starting median, as argument.
> > So the final Sliding Window Profile for OUTLIER_MAD state tracking might
> > look more like:
> >
> >
> >
> > {
> >
> >   "profiles": [
> >
> >     {
> >
> >       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
> >
> >       "foreach": "'global'",
> >
> >       "init" : {
> >
> >         "median": "STATS_PERCENTILE(STATS_MERGE(
> PROFILE_GET('sketchy_values',
> > 'global', 5, 'MINUTES')), 50)",
> >
> >         "s": "OUTLIER_MAD_INIT(median)",
> >
> >         "val_dist": "STATS_INIT()"
> >
> >         },
> >
> >       "update": {
> >
> >         "val_dist": "STATS_ADD(val_dist, value)",
> >
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >
> >         },
> >
> >       "result": {
> >
> >         "sketchy_values": "val_dist",
> >
> >         "sketchy_mad": "s",
> >
> >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > PROFILE_GET('sketchy_mad',
> >
> > 'global', 4, 'MINUTES'))"
> >
> >         }
> >
> >     },
> >
> >   ]
> >
> > }
> >
> >
> >
> > Thanks,
> >
> > --Matt
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> --
> Nick Allen <ni...@nickallen.org>
>

Re: [DISCUSS] How to do Sliding Windows in Profiler

Posted by Nick Allen <ni...@nickallen.org>.

That was a lot to digest so I apologize if I have missed some of your
thought process.  I probably need to read through this a couple more times.


Can anyone specify a better way to do Sliding Window profiles correctly
> with current functionality?


Could we not change the underlying implementation of the profiler to
support sliding windows?  Ideally the same profile definition could be
applied to either a tumbling or sliding window, rather than requiring a
change in the profile definition itself.

The work that I had done for METRON-590, made it possible to configure
either a tumbling or sliding window.  The downside of that work is that it
was an all-or-nothing change, all profiles (in the same topology) were
either tumbling or sliding.  But perhaps we can enhance it from there.


A much more elegant solution would be to allow “result” to write two
> objects.


Yes, I like this! I have had the same thought myself.  This would be a
simple change and would provide the user with some flexibility.  Does this
change solve the entire problem though?


Obviously, I do need to sit down and think on this a bit more.  I really
wish we could handle these use cases with a profile definition that remains
simple and easy to grok.  Thanks for spending the time to think through
this.

On Wed, Jan 25, 2017 at 2:00 AM, Matt Foley <ma...@apache.org> wrote:

> Hi all,
>
> Casey and I had an interesting chat yesterday, during which we agreed that
> the example code for Outlier Analysis in https://github.com/apache/
> incubator-metron/blob/master/metron-analytics/metron-statistics/README.md
> and the revised example code in https://issues.apache.org/
> jira/browse/METRON-668 (as of 23 January) both do not correctly implement
> the desired Sliding Window model.  This email gives the argument for why,
> and proposes a couple ways to do it right.  Your input and preferences are
> requested.
>
>
>
> First a couple statements about the STATS object that underlies most
> interesting Profile work:
>
> ·         The STATS object is a t-digest.  It summarizes a set of data
> points, such as those received during a sampling period, in a way that is
> nominally O(1) regardless of the input number of data points, and preserves
> the info about the “shape” of the statistical distribution of those
> points.  Not only info about averages and standard deviations, but also
> about medians and percentiles (which, btw, is a very different kind of
> information), is preserved and can be calculated correctly to within a
> given error epsilon.  Since it is a summary, however, time information is
> lost.
>
> ·         STATS objects, these digests of sampling periods, are MERGEABLE,
> meaning if you have a digest from time(1) to time(2), and another digest
> from time(2) to time (3), you can merge them and get a digest that is
> statistically equivalent to a digest from time(1) to time(3) continuously.
>
> ·         They are sort of idempotent, in that if you take two identical
> digests and merge them, you get almost the same object.  However, the
> result object will be scaled as summarizing twice the number of input data
> points.
>
> ·         Which is why it DOESN’T work to try to merge overlapping
> sampling periods.  To give a crude example, if you have a digest from
> time(1) to time(3) and another digest from time(2) to time(4), and merge
> them, the samples from time(2) to time(3) will be over-represented by a
> factor of 2x, which should be expected to skew the distribution (unless the
> distribution really is constant for all sub-windows – which would mean we
> don’t need Sliding Windows because nothing changes).
>
>
>
> The Outlier Analysis profiles linked above try to implement a sliding
> window, in which each profile period summarizes the Median Absolute
> Deviation distribution of the last five profile periods only.  An “Outlier
> MAD Score” can then be determined by comparing the deviation of a new data
> point to the last MAD distribution recorded in the Profile.  This allows
> for changes over time in the statistical distribution of inputs, but does
> not make the measure unduly sensitive to just the last minute or two.  This
> is a typical use case for Sliding Windows.
>
>
>
> Both example codes trip on how to do the sliding window in the context of
> a Profile.  At sampling period boundaries, both either compose the “result”
> digest or initialize the “next” digest by reading the previous 5 result
> digests.  That is wrong, because it ignores the fact that those digests
> aren’t just for their time periods.  They too were composed with THEIR
> preceding 5 digests, each of which were composed with their preceding 5
> digests, which in turn… etc.  The end result is sort of like the way
> Madieras or some brandies are aged via the Solera process with fractional
> blending.  You don’t get a true sliding window, which sharply drops the
> past, you get a continuous dilution of the past. In fact, it’s wrong to
> assume that the “tail” of the far past is more diluted than the near past!
> – It would be with some algorithms, but the alg used in these two examples
> causes the far past to become an exponentially MORE important fraction of
> the overall data than the near past – much worse than simply turning on
> digesting at time(0) and leaving it on, with no attempt at windowing.
> (Simulate it in a spreadsheet and you’ll see.)
>
>
>
> We need a Profiler structure that assists in creating Sliding Window
> profiles.  The problem is that Profiles let you save only one thing
> (quantity or object) per sampling period, and that’s typically a different
> “thing” (object type or scale) than you want to use to compose the result
> for each windowed span.  One way to do it correctly would be with two
> Profiles, like this:
>
>
>
> (SOLUTION A)
>
> {
>
>   "profiles": [
>
>     {
>
>       "profile": "sketchy_mad",
>
>       "foreach": "'global'",
>
>       "init" : {
>
>         "s": "OUTLIER_MAD_INIT()"
>
>         },
>
>       "update": {
>
>         "s": "OUTLIER_MAD_ADD(s, value)"
>
>         },
>
>       "result": "s"
>
>     },
>
>     {
>
>       "profile": "windowed_mad",
>
>       "foreach": "'global'",
>
>       "init" : { },
>
>       "update": { },
>
>       "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
>
> 'global', 5, 'MINUTES'))"
>
>     }
>
>   ]
>
> }
>
>
>
> This is typical.  You have a fine-grain sampling period that you want to
> “tumble”, and a broader window that you want to “slide” or “roll” along the
> underlying smaller sampling periods.
>
> The above pair of profiles is clear enough, but not real efficient.  Also,
> you can’t guarantee they’ll stay exactly synchronized.  It would be better
> to combine them somehow.
>
>
>
> In these examples, we only need to look back 5 sampling periods, so the
> following would work (NOTE: requires Casey’s PR#668 that preserves state
> between sampling periods):
>
>
>
> (SOLUTION B)
>
> {
>
>   "profiles": [
>
>     {
>
>       "profile": "windowed_mad",
>
>       "foreach": "'global'",
>
>       "init" : {
>
>         "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",
>
>         "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",
>
>         "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",
>
>        "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",
>
>         "s0": "OUTLIER_MAD_INIT()"
>
>         },
>
>       "update": {
>
>         "s0": "OUTLIER_MAD_ADD(s0, value)"
>
>         },
>
>       "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”
>
>     }
>
>   ]
>
> }
>
>
>
> This works fine for a sliding window of size 5 sample periods.  But what
> if it was 15?  Or 60?
>
> No doubt we could implement a ring buffer with a stellar list object, but
> it’s gonna get ugly(er) and slow.
>
>
>
> A much more elegant solution would be to allow “result” to write two
> objects.  The result could be viewed either as two Profiles, or as a
> Profile with multiple addressable columns of hbase output.  For instance:
>
> (SOLUTION C)
>
> {
>
>   "profiles": [
>
>     {
>
>       "profile": "[ 'sketchy_mad', 'windowed_mad' ]",
>
>       "foreach": "'global'",
>
>       "init" : {
>
>         "s": "OUTLIER_MAD_INIT()"
>
>         },
>
>       "update": {
>
>         "s": "OUTLIER_MAD_ADD(s, value)"
>
>         },
>
>       "result": {
>
>         "sketchy_mad": "s",
>
>         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> PROFILE_GET('sketchy_mad',
>
> 'global', 4, 'MINUTES'))"
>
>         }
>
>     }
>
>   ]
>
> }
>
>
>
> This is clean and tight, and I think will be easily generalized to any
> Sliding Window scenario.
>
> BTW, given that each write has to work it’s way through more processing
> before getting to HBase, it’s best to say that all writes happen AFTER all
> the “result” processing, which is why I showed the MERGE above retrieving
> only 4 sampling periods prior to the current one, and then add the current
> one from local state.
>
>
>
> What do you think?  Can anyone specify a better way to do Sliding Window
> profiles correctly with current functionality?
>
>
>
>
>
> Casey noted that if this feature were available, he would like to consider
> separating the value distribution from the deviation distribution, in the
> OUTLIER_MAD state objects.  He also notes that the hypothetical
> OUTLIER_MAD_INIT() function needs to take a starting median, as argument.
> So the final Sliding Window Profile for OUTLIER_MAD state tracking might
> look more like:
>
>
>
> {
>
>   "profiles": [
>
>     {
>
>       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
>
>       "foreach": "'global'",
>
>       "init" : {
>
>         "median": "STATS_PERCENTILE(STATS_MERGE(PROFILE_GET('sketchy_values',
> 'global', 5, 'MINUTES')), 50)",
>
>         "s": "OUTLIER_MAD_INIT(median)",
>
>         "val_dist": "STATS_INIT()"
>
>         },
>
>       "update": {
>
>         "val_dist": "STATS_ADD(val_dist, value)",
>
>         "s": "OUTLIER_MAD_ADD(s, value)"
>
>         },
>
>       "result": {
>
>         "sketchy_values": "val_dist",
>
>         "sketchy_mad": "s",
>
>         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> PROFILE_GET('sketchy_mad',
>
> 'global', 4, 'MINUTES'))"
>
>         }
>
>     },
>
>   ]
>
> }
>
>
>
> Thanks,
>
> --Matt
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


-- 
Nick Allen <ni...@nickallen.org>