You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/05/22 00:03:28 UTC

OnlineSummarizer mods

How would one change OnlineSummarizer to allow setting the 1 and 3
quartiles? They are hard-coded at 25% and 75%.

Q1 and Q3 at 02%/98% gives a way to ignore outliers.

-- 
Lance Norskog
goksron@gmail.com

Re: OnlineSummarizer mods

Posted by Lance Norskog <go...@gmail.com>.

Thanks!

On Sun, May 22, 2011 at 12:45 AM, Ted Dunning <te...@gmail.com> wrote:
> The simplest implementation for this would be to keep a tree of values.
>  While you have less than N (about 1000) values in the tree, you add each
> incoming value to the tree.  When you have more than N values after adding a
> new sample, pick values at random to delete until you are down to N values
> in the tree.  Any time you have a value that is in the top or bottom 2% you
> can log it.  This will get very nearly the result you want at the cost of a
> bit of work for the insertion and deletion.
>
> The data structure to use is a little tricky.  You want to be able to
> efficiently insert an element, of course, but you also want to find the n-th
> largest element quickly and to determine what the rank of a newly inserted
> element is reasonably quickly.  For most applications with time scales on
> the order of milliseconds, simply keeping a sorted array will be a fine
> solution.  This obviously costs a fair bit to shift things when you do an
> insertion.  One fun trick to improve this by a substantial factor is to size
> the array larger than necessary and keep "tombstone" markers in the array
> where ever you have deleted an element.  Then when shifting to make room for
> a new element, you will have to shift data only until you find a tombstone
> or the end of the array and you won't have to shift at all on deletion.
>  Finding the current rank or finding the n-th largest element will require a
> scan, but because of caching and fast processors, this will probably be
> nearly as fast as any fancy data structure you can build especially since
> you will only be scanning from one end or the other.
>
> This won't be terribly accurate, not will it be as tight on memory, but it
> will be very simple to get right and for the slow query logging problem, it
> should be plenty good enough.
>
> On Sun, May 22, 2011 at 12:08 AM, Lance Norskog <go...@gmail.com> wrote:
>
>> Here is the use case: quantiles for search query times. 2% gives the
>> fastest possible queries, and 98% gives the slowest normal queries and
>> ignores a garbage collection colliding with a cosmic ray.
>>
>> The next step is to maintain a reservoir sample of search queries, and
>> calculate the quantiles on demand. This allows drilling down into
>> "what queries are consistently slow?". This definitely wants the
>> online algorithm, because it allows two-pass classification: the first
>> pass through the queries establishes the quantiles, and the second
>> pass classifies the query time among the quantiles. This would be very
>> useful.
>>
>> Lance
>>
>> On Sat, May 21, 2011 at 10:14 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > This is not the same algorithm at all.  In the first place, there is a
>> > requirement that you know the size of your data-set in advance so that
>> you
>> > can set \tau.  It might be possible to adaptively set \tau as you go, but
>> > that isn't obvious to me in the few minutes I have looked at this.  My
>> worry
>> > is that if you do this, you may have already thrown away important data
>> by
>> > the time that you realize that \tau needs to be larger than you expected.
>> >
>> > Secondly, this appears to be most useful for computing moderately extreme
>> > quantiles.  That may be what you intend to do.  Or not.
>> >
>> > I think that the other algorithm is better in that it requires much less
>> > memory and is completely on-line.
>> >
>> > The basic idea in the on-line algorithm is that you maintain an estimate
>> of
>> > the quantile that you want and an estimate of the PDF at that point.  You
>> > estimate the PDF by keeping a count of the points that you have seen
>> within
>> > bounds that you tighten over time.  The current estimate is adjust up or
>> > down by an amount based on the quantile that you are estimating times the
>> > current PDF estimate.  For each quantile that you are estimating, you
>> only
>> > need to keep a few values.  In the case of the current Mahout
>> > implementation, I re-use data stored for one quantile for a different
>> > purpose in the estimation of other quantiles so that we pretty much just
>> > need to keep the 5 current quartile estimates and possibly a total count.
>> >  This compares to keeping a thousand values plus counts per quantile with
>> > the IBM paper you reference.
>> >
>> >
>> > On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <go...@gmail.com>
>> wrote:
>> >
>> >> This seems to be the same thing, from '95. And simple enough that even
>> >> I can follow it.
>> >>
>> >>
>> >>
>> http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf
>> >>
>> >> On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <te...@gmail.com>
>> >> wrote:
>> >> > By definition, quaRtiles are quartiles.
>> >> >
>> >> > You can adapt the code to generate other quaNtiles by adjusting the
>> >> > parameters that adjust the estimates up and down.  Right now, I use
>> the
>> >> > inter-quartile range to estimate the probability density.  You would
>> need
>> >> to
>> >> > do something else.
>> >> >
>> >> > The original paper has some help in this regard.  See the javadoc for
>> the
>> >> > link.
>> >> >
>> >> > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <go...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> How would one change OnlineSummarizer to allow setting the 1 and 3
>> >> >> quartiles? They are hard-coded at 25% and 75%.
>> >> >>
>> >> >> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
>> >> >>
>> >> >> --
>> >> >> Lance Norskog
>> >> >> goksron@gmail.com
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Lance Norskog
>> >> goksron@gmail.com
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: OnlineSummarizer mods

Posted by Ted Dunning <te...@gmail.com>.

The simplest implementation for this would be to keep a tree of values.
 While you have less than N (about 1000) values in the tree, you add each
incoming value to the tree.  When you have more than N values after adding a
new sample, pick values at random to delete until you are down to N values
in the tree.  Any time you have a value that is in the top or bottom 2% you
can log it.  This will get very nearly the result you want at the cost of a
bit of work for the insertion and deletion.

The data structure to use is a little tricky.  You want to be able to
efficiently insert an element, of course, but you also want to find the n-th
largest element quickly and to determine what the rank of a newly inserted
element is reasonably quickly.  For most applications with time scales on
the order of milliseconds, simply keeping a sorted array will be a fine
solution.  This obviously costs a fair bit to shift things when you do an
insertion.  One fun trick to improve this by a substantial factor is to size
the array larger than necessary and keep "tombstone" markers in the array
where ever you have deleted an element.  Then when shifting to make room for
a new element, you will have to shift data only until you find a tombstone
or the end of the array and you won't have to shift at all on deletion.
 Finding the current rank or finding the n-th largest element will require a
scan, but because of caching and fast processors, this will probably be
nearly as fast as any fancy data structure you can build especially since
you will only be scanning from one end or the other.

This won't be terribly accurate, not will it be as tight on memory, but it
will be very simple to get right and for the slow query logging problem, it
should be plenty good enough.

On Sun, May 22, 2011 at 12:08 AM, Lance Norskog <go...@gmail.com> wrote:

> Here is the use case: quantiles for search query times. 2% gives the
> fastest possible queries, and 98% gives the slowest normal queries and
> ignores a garbage collection colliding with a cosmic ray.
>
> The next step is to maintain a reservoir sample of search queries, and
> calculate the quantiles on demand. This allows drilling down into
> "what queries are consistently slow?". This definitely wants the
> online algorithm, because it allows two-pass classification: the first
> pass through the queries establishes the quantiles, and the second
> pass classifies the query time among the quantiles. This would be very
> useful.
>
> Lance
>
> On Sat, May 21, 2011 at 10:14 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > This is not the same algorithm at all.  In the first place, there is a
> > requirement that you know the size of your data-set in advance so that
> you
> > can set \tau.  It might be possible to adaptively set \tau as you go, but
> > that isn't obvious to me in the few minutes I have looked at this.  My
> worry
> > is that if you do this, you may have already thrown away important data
> by
> > the time that you realize that \tau needs to be larger than you expected.
> >
> > Secondly, this appears to be most useful for computing moderately extreme
> > quantiles.  That may be what you intend to do.  Or not.
> >
> > I think that the other algorithm is better in that it requires much less
> > memory and is completely on-line.
> >
> > The basic idea in the on-line algorithm is that you maintain an estimate
> of
> > the quantile that you want and an estimate of the PDF at that point.  You
> > estimate the PDF by keeping a count of the points that you have seen
> within
> > bounds that you tighten over time.  The current estimate is adjust up or
> > down by an amount based on the quantile that you are estimating times the
> > current PDF estimate.  For each quantile that you are estimating, you
> only
> > need to keep a few values.  In the case of the current Mahout
> > implementation, I re-use data stored for one quantile for a different
> > purpose in the estimation of other quantiles so that we pretty much just
> > need to keep the 5 current quartile estimates and possibly a total count.
> >  This compares to keeping a thousand values plus counts per quantile with
> > the IBM paper you reference.
> >
> >
> > On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >
> >> This seems to be the same thing, from '95. And simple enough that even
> >> I can follow it.
> >>
> >>
> >>
> http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf
> >>
> >> On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >> > By definition, quaRtiles are quartiles.
> >> >
> >> > You can adapt the code to generate other quaNtiles by adjusting the
> >> > parameters that adjust the estimates up and down.  Right now, I use
> the
> >> > inter-quartile range to estimate the probability density.  You would
> need
> >> to
> >> > do something else.
> >> >
> >> > The original paper has some help in this regard.  See the javadoc for
> the
> >> > link.
> >> >
> >> > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <go...@gmail.com>
> >> wrote:
> >> >
> >> >> How would one change OnlineSummarizer to allow setting the 1 and 3
> >> >> quartiles? They are hard-coded at 25% and 75%.
> >> >>
> >> >> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
> >> >>
> >> >> --
> >> >> Lance Norskog
> >> >> goksron@gmail.com
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: OnlineSummarizer mods

Posted by Lance Norskog <go...@gmail.com>.

Here is the use case: quantiles for search query times. 2% gives the
fastest possible queries, and 98% gives the slowest normal queries and
ignores a garbage collection colliding with a cosmic ray.

The next step is to maintain a reservoir sample of search queries, and
calculate the quantiles on demand. This allows drilling down into
"what queries are consistently slow?". This definitely wants the
online algorithm, because it allows two-pass classification: the first
pass through the queries establishes the quantiles, and the second
pass classifies the query time among the quantiles. This would be very
useful.

Lance

On Sat, May 21, 2011 at 10:14 PM, Ted Dunning <te...@gmail.com> wrote:
> This is not the same algorithm at all.  In the first place, there is a
> requirement that you know the size of your data-set in advance so that you
> can set \tau.  It might be possible to adaptively set \tau as you go, but
> that isn't obvious to me in the few minutes I have looked at this.  My worry
> is that if you do this, you may have already thrown away important data by
> the time that you realize that \tau needs to be larger than you expected.
>
> Secondly, this appears to be most useful for computing moderately extreme
> quantiles.  That may be what you intend to do.  Or not.
>
> I think that the other algorithm is better in that it requires much less
> memory and is completely on-line.
>
> The basic idea in the on-line algorithm is that you maintain an estimate of
> the quantile that you want and an estimate of the PDF at that point.  You
> estimate the PDF by keeping a count of the points that you have seen within
> bounds that you tighten over time.  The current estimate is adjust up or
> down by an amount based on the quantile that you are estimating times the
> current PDF estimate.  For each quantile that you are estimating, you only
> need to keep a few values.  In the case of the current Mahout
> implementation, I re-use data stored for one quantile for a different
> purpose in the estimation of other quantiles so that we pretty much just
> need to keep the 5 current quartile estimates and possibly a total count.
>  This compares to keeping a thousand values plus counts per quantile with
> the IBM paper you reference.
>
>
> On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> This seems to be the same thing, from '95. And simple enough that even
>> I can follow it.
>>
>>
>> http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf
>>
>> On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > By definition, quaRtiles are quartiles.
>> >
>> > You can adapt the code to generate other quaNtiles by adjusting the
>> > parameters that adjust the estimates up and down.  Right now, I use the
>> > inter-quartile range to estimate the probability density.  You would need
>> to
>> > do something else.
>> >
>> > The original paper has some help in this regard.  See the javadoc for the
>> > link.
>> >
>> > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <go...@gmail.com>
>> wrote:
>> >
>> >> How would one change OnlineSummarizer to allow setting the 1 and 3
>> >> quartiles? They are hard-coded at 25% and 75%.
>> >>
>> >> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
>> >>
>> >> --
>> >> Lance Norskog
>> >> goksron@gmail.com
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: OnlineSummarizer mods

Posted by Ted Dunning <te...@gmail.com>.

This is not the same algorithm at all.  In the first place, there is a
requirement that you know the size of your data-set in advance so that you
can set \tau.  It might be possible to adaptively set \tau as you go, but
that isn't obvious to me in the few minutes I have looked at this.  My worry
is that if you do this, you may have already thrown away important data by
the time that you realize that \tau needs to be larger than you expected.

Secondly, this appears to be most useful for computing moderately extreme
quantiles.  That may be what you intend to do.  Or not.

I think that the other algorithm is better in that it requires much less
memory and is completely on-line.

The basic idea in the on-line algorithm is that you maintain an estimate of
the quantile that you want and an estimate of the PDF at that point.  You
estimate the PDF by keeping a count of the points that you have seen within
bounds that you tighten over time.  The current estimate is adjust up or
down by an amount based on the quantile that you are estimating times the
current PDF estimate.  For each quantile that you are estimating, you only
need to keep a few values.  In the case of the current Mahout
implementation, I re-use data stored for one quantile for a different
purpose in the estimation of other quantiles so that we pretty much just
need to keep the 5 current quartile estimates and possibly a total count.
 This compares to keeping a thousand values plus counts per quantile with
the IBM paper you reference.

On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <go...@gmail.com> wrote:

> This seems to be the same thing, from '95. And simple enough that even
> I can follow it.
>
>
> http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf
>
> On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > By definition, quaRtiles are quartiles.
> >
> > You can adapt the code to generate other quaNtiles by adjusting the
> > parameters that adjust the estimates up and down.  Right now, I use the
> > inter-quartile range to estimate the probability density.  You would need
> to
> > do something else.
> >
> > The original paper has some help in this regard.  See the javadoc for the
> > link.
> >
> > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >
> >> How would one change OnlineSummarizer to allow setting the 1 and 3
> >> quartiles? They are hard-coded at 25% and 75%.
> >>
> >> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: OnlineSummarizer mods

Posted by Lance Norskog <go...@gmail.com>.

This seems to be the same thing, from '95. And simple enough that even
I can follow it.

http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf

On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <te...@gmail.com> wrote:
> By definition, quaRtiles are quartiles.
>
> You can adapt the code to generate other quaNtiles by adjusting the
> parameters that adjust the estimates up and down.  Right now, I use the
> inter-quartile range to estimate the probability density.  You would need to
> do something else.
>
> The original paper has some help in this regard.  See the javadoc for the
> link.
>
> On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> How would one change OnlineSummarizer to allow setting the 1 and 3
>> quartiles? They are hard-coded at 25% and 75%.
>>
>> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: OnlineSummarizer mods

Posted by Ted Dunning <te...@gmail.com>.

By definition, quaRtiles are quartiles.

You can adapt the code to generate other quaNtiles by adjusting the
parameters that adjust the estimates up and down.  Right now, I use the
inter-quartile range to estimate the probability density.  You would need to
do something else.

The original paper has some help in this regard.  See the javadoc for the
link.

On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <go...@gmail.com> wrote:

> How would one change OnlineSummarizer to allow setting the 1 and 3
> quartiles? They are hard-coded at 25% and 75%.
>
> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
>
> --
> Lance Norskog
> goksron@gmail.com
>