You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datasketches.apache.org by leerho <le...@gmail.com> on 2019/10/19 17:31:43 UTC

DDSketch

Science team,

Would you please take a look at this sketch.  It appears to address the
"relative error" for quantiles issue we have been discussing recently.

https://blog.acolyer.org/2019/09/06/ddsketch/

Thanks,

Lee.

Re: DDSketch

Posted by Justin Thaler <Ju...@georgetown.edu>.
> But it is easy to come up with contrived examples where this notion of
error makes no sense. E.g., consider if all of the data items are numbers
in some small-length range, but the numbers themselves of huge (for
concreteness, say [10^{20}, 10^{20}+i] for some small value of i), their
definition permits additive error that is vastly bigger than the actual
spread of the data (in my contrived example, the error can grow like
10^{20}, even though error at most i can be achieved just by storing the
max and min).

To clarify the point I'm making above:

I am not saying that, on this contrived example, their algorithm would
actually return an item \tilde{x}_q outside of the data range [10^{20},
10^{20}+i] (it wouldn't). But their problem definition would allow an
algorithm to do this, and still be called a "small-relative-error"
algorithm according to the definition.

So I am highlighting a setting where the problem definition itself seems
inappropriate, rather than a setting where the algorithm actually would
behave badly.

(As the paper itself acknowledges, one can easily come up with examples
where the algorithm behaves badly, at least if one tries to keep the space
constant...).

On Tue, Oct 22, 2019 at 8:56 AM Justin Thaler <ju...@georgetown.edu>
wrote:

> This paper does not solve the relative error problem we've discussed
> previously. Which isn't to say it's not interesting, but it's not the same
> problem (it's unfortunate that the paper picked the same name that we've
> been using, when the problems are actually very different).
>
> Here's the situation.
>
> When we've said "relative-error quantiles", we've meant what the paper
> calls "relative-error rank-error". That means, roughly, that if someone
> specifies a query percentile q, the algorithm should return an item whose
> true percentile is (1 \pm epsilon)*q for some small epsilon.
>
> When the paper says "relative-error quantiles", what it means is that if
> someone someone specifies a query percentile q, and the true
> q'th-percentile-item in the dataset is x_q, then the algorithm should
> return a item \tilde{x}_q such that |x_q - \tilde{x}_q|<=epsilon*x_q.
>
> So in their problem, error is measured with respect to the _magnitude of
> the data items_.
>
> This can certainly make sense for numerical data falling in some
> understood range, e.g., they focus on latency of web requests, which
> apparently are typically between milliseconds and minutes.
>
> But it is easy to come up with contrived examples where this notion of
> error makes no sense. E.g., consider if all of the data items are numbers
> in some small-length range, but the numbers themselves of huge (for
> concreteness, say [10^{20}, 10^{20}+i] for some small value of i), their
> definition permits additive error that is vastly bigger than the actual
> spread of the data (in my contrived example, the error can grow like
> 10^{20}, even though error at most i can be achieved just by storing the
> max and min).
>
> In our problem, the actual data items are never referred to, only the
> ranks---in fact, our problem makes sense even if data is not numerical, so
> long as it is totally ordered.
>
> Their algorithm is very simple, and it does have provable worst-case error
> guarantees if one lets the space cost grow logarithmically with the
> universe size. In the paper, they seem to want the space cost to be
> constant independent of the universe size, and they develop some pruning
> heuristics to deal with this, but these heuristics don't (and can't) come
> with worst-case guarantees. All of their experiments and analysis of these
> settings are for highly "well-behaved" data (e.g., iid from very
> well-understood distributions).
>
> They seem to be the first work (other than HDRHistogram, which I haven't
> looked at but which apparently doesn't come with provable error guarantees)
> to study this notion of error for quantiles. I am guessing this is at least
> in part because solving the problem they define is not super difficult from
> an algorithmic perspective.
>
> But, to be clear, if their notion of error actually makes sense for an
> application, and one is happy letting the space grow logarithmically with
> the universe size, then simplicity is a virtue, not a flaw.
>
> BTW I knew one of the authors when I was starting grad school. We've lost
> touch, but I'm happy to see a new paper by him!
>
> On Sun, Oct 20, 2019 at 2:42 PM Jon Malkin <jo...@gmail.com> wrote:
>
>> A cursory glance suggests that it's O(N) in size of you want guarantees
>> against an adversary. The bucket collapsing method reduces accuracy in
>> collapsed buckets. If so, the attack would be easiest when distributed to
>> manipulate data splits so that you collapse significant blocks of data
>> together when merging.
>>
>> I'll see if I have a chance to read things more carefully this coming
>> week.
>>
>>   jon
>>
>> On Sat, Oct 19, 2019, 10:32 AM leerho <le...@gmail.com> wrote:
>>
>> > Science team,
>> >
>> > Would you please take a look at this sketch.  It appears to address the
>> > "relative error" for quantiles issue we have been discussing recently.
>> >
>> > https://blog.acolyer.org/2019/09/06/ddsketch/
>> >
>> > Thanks,
>> >
>> > Lee.
>> >
>>
>

Re: DDSketch

Posted by Justin Thaler <Ju...@georgetown.edu>.
This paper does not solve the relative error problem we've discussed
previously. Which isn't to say it's not interesting, but it's not the same
problem (it's unfortunate that the paper picked the same name that we've
been using, when the problems are actually very different).

Here's the situation.

When we've said "relative-error quantiles", we've meant what the paper
calls "relative-error rank-error". That means, roughly, that if someone
specifies a query percentile q, the algorithm should return an item whose
true percentile is (1 \pm epsilon)*q for some small epsilon.

When the paper says "relative-error quantiles", what it means is that if
someone someone specifies a query percentile q, and the true
q'th-percentile-item in the dataset is x_q, then the algorithm should
return a item \tilde{x}_q such that |x_q - \tilde{x}_q|<=epsilon*x_q.

So in their problem, error is measured with respect to the _magnitude of
the data items_.

This can certainly make sense for numerical data falling in some understood
range, e.g., they focus on latency of web requests, which apparently are
typically between milliseconds and minutes.

But it is easy to come up with contrived examples where this notion of
error makes no sense. E.g., consider if all of the data items are numbers
in some small-length range, but the numbers themselves of huge (for
concreteness, say [10^{20}, 10^{20}+i] for some small value of i), their
definition permits additive error that is vastly bigger than the actual
spread of the data (in my contrived example, the error can grow like
10^{20}, even though error at most i can be achieved just by storing the
max and min).

In our problem, the actual data items are never referred to, only the
ranks---in fact, our problem makes sense even if data is not numerical, so
long as it is totally ordered.

Their algorithm is very simple, and it does have provable worst-case error
guarantees if one lets the space cost grow logarithmically with the
universe size. In the paper, they seem to want the space cost to be
constant independent of the universe size, and they develop some pruning
heuristics to deal with this, but these heuristics don't (and can't) come
with worst-case guarantees. All of their experiments and analysis of these
settings are for highly "well-behaved" data (e.g., iid from very
well-understood distributions).

They seem to be the first work (other than HDRHistogram, which I haven't
looked at but which apparently doesn't come with provable error guarantees)
to study this notion of error for quantiles. I am guessing this is at least
in part because solving the problem they define is not super difficult from
an algorithmic perspective.

But, to be clear, if their notion of error actually makes sense for an
application, and one is happy letting the space grow logarithmically with
the universe size, then simplicity is a virtue, not a flaw.

BTW I knew one of the authors when I was starting grad school. We've lost
touch, but I'm happy to see a new paper by him!

On Sun, Oct 20, 2019 at 2:42 PM Jon Malkin <jo...@gmail.com> wrote:

> A cursory glance suggests that it's O(N) in size of you want guarantees
> against an adversary. The bucket collapsing method reduces accuracy in
> collapsed buckets. If so, the attack would be easiest when distributed to
> manipulate data splits so that you collapse significant blocks of data
> together when merging.
>
> I'll see if I have a chance to read things more carefully this coming week.
>
>   jon
>
> On Sat, Oct 19, 2019, 10:32 AM leerho <le...@gmail.com> wrote:
>
> > Science team,
> >
> > Would you please take a look at this sketch.  It appears to address the
> > "relative error" for quantiles issue we have been discussing recently.
> >
> > https://blog.acolyer.org/2019/09/06/ddsketch/
> >
> > Thanks,
> >
> > Lee.
> >
>

Re: DDSketch

Posted by Jon Malkin <jo...@gmail.com>.
A cursory glance suggests that it's O(N) in size of you want guarantees
against an adversary. The bucket collapsing method reduces accuracy in
collapsed buckets. If so, the attack would be easiest when distributed to
manipulate data splits so that you collapse significant blocks of data
together when merging.

I'll see if I have a chance to read things more carefully this coming week.

  jon

On Sat, Oct 19, 2019, 10:32 AM leerho <le...@gmail.com> wrote:

> Science team,
>
> Would you please take a look at this sketch.  It appears to address the
> "relative error" for quantiles issue we have been discussing recently.
>
> https://blog.acolyer.org/2019/09/06/ddsketch/
>
> Thanks,
>
> Lee.
>