You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Alexander Saydakov <sa...@verizonmedia.com.INVALID> on 2020/11/06 18:19:31 UTC

Re: [E] quantilesDoubleSketches min/max postaggregators

quantile(0) = min value
quantile(1) = max value
you can use sketch-to-quantiles post agg to get min, max or any number of
other quantiles

Regarding your observation that sketch-to-histogram(num bins) does not give
information about the computed split points. That is valuable feedback.
Perhaps, we could consider returning the split points somehow, but I am not
quite sure what the return type should be. We need to return two arrays:
probability mass in each bin as we do currently - that is one array of
doubles, and split points computed from min, max, and given number of bins.
And this post agg can accept split points - should we return them in that
case as well for consistency?


On Fri, Nov 6, 2020 at 3:30 AM Jérémie Girault <je...@hubvisor.io> wrote:

> Hello everyone,
>
> I previously asked a question on the ASF slack and someone replied to me
> by asking me to send the question on the dev list. I just subscribed to the
> list to forward the message I sent :
>
> I was playing with the DataSketches Quantiles Sketch module in druid
> trying to retrieve some histograms using quantilesDoublesSketchToHistogram.
> However I couldn't label the values I retrieved for each bin when using
> numBins when trying to plot them.
> I can’t seem to find any postAggregator that allows me to get min/max
> values in order to recompute bins on the client side.
> Should I use min/max aggregators when ingesting, and query them alongside
> my histogram as a workaround ? It seem a lot of space/time that would seem
> to be « free » to retrieve using Quantile Sketches.
> Wouldn’t it be useful to have min/max postAggregators for
> quantilesDoubleSketches aggregator and/or histogram bins labels ?
> I located this chunk of code:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_blob_master_extensions-2Dcore_datasketches_src_main_java_org_apache_druid_query_aggregation_datasketches_quantiles_DoublesSketchToHistogramPostAggregator.java&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=PUk1rdn3YFgKzf5pRy7hKdCZt_J-_DZgbh_wjexBneI&s=fb3Uh150BuY9jtM8DqGofrqtwQrM9jDfupPq6MwF5hk&e=
> That does not seem overly complicated in a way I could not contribute, but
> I’m not used to java dev these days and it would take me a while to get it
> right.
> Would such features be considered if requested/submitted ?
>
> Thank you,
>
> --
>
> Jérémie Girault
>

Re: [E] quantilesDoubleSketches min/max postaggregators

Posted by Alexander Saydakov <sa...@verizonmedia.com.INVALID>.
Keep in mind that two arrays have different sizes. If you start from N
split points, the result is N+1 bins. If you ask for N bins, the internal
logic produces N-1 split points. And you want these split points to be a
part of the returned result.

On Tue, Nov 10, 2020 at 10:23 AM Alexander Saydakov <
saydakov@verizonmedia.com> wrote:

> I am not sure how important the compatibility with the current version is.
> I am afraid I don't have time to work on this at the moment.
> I would like to see some discussion about the best way forward.
> Would you be willing to contribute this change once the community agrees
> on the output format?
>
> On Mon, Nov 9, 2020 at 3:36 AM Jérémie Girault <je...@hubvisor.io>
> wrote:
>
>> Hello,
>>
>> This info about q0 and q1 is good to know, I will use it, thank you !
>>
>> As a user in order to plot that I would be glad to get the split points
>> alongside the histogram values.
>> It would be as useful to retrieve them from the `numBins` or
>> `splitPoints` for consistency indeed: when I need to display histogram I
>> don’t want to use two different code path to handle the request result.
>>
>> I can imagine different formats I could use with each pro and cons :
>> - list of tuple: `[ [ <bin value>, <bin count> ], ... ]`
>>         pro: simple
>>         con: the format may be confusing without docs, breaks the current
>> output format (can be solved by adding a flag controlling output)
>> - list of objects: `[ { "value": <value>, "count": <count> }, ...]`
>>         pro: simple, timeseries-like, probably the most easy to display
>>         con: breaks the current output format (can be solved by adding a
>> flag controlling output)
>> - bins postAggregator + histogram values postAggregator : `{ bins: [ ...
>> ], values: [ ... ] }`
>>         pro: compatible with current format, feature is available
>> on-demand
>>         con: must zip arrays on client side
>>
>> What do you think ?
>>
>> --
>>
>> Jérémie Girault
>> Le 6 nov. 2020 à 19:19 +0100, Alexander Saydakov <
>> saydakov@verizonmedia.com.invalid>, a écrit :
>> > quantile(0) = min value
>> > quantile(1) = max value
>> > you can use sketch-to-quantiles post agg to get min, max or any number
>> of
>> > other quantiles
>> >
>> > Regarding your observation that sketch-to-histogram(num bins) does not
>> give
>> > information about the computed split points. That is valuable feedback.
>> > Perhaps, we could consider returning the split points somehow, but I am
>> not
>> > quite sure what the return type should be. We need to return two arrays:
>> > probability mass in each bin as we do currently - that is one array of
>> > doubles, and split points computed from min, max, and given number of
>> bins.
>> > And this post agg can accept split points - should we return them in
>> that
>> > case as well for consistency?
>> >
>> >
>> > On Fri, Nov 6, 2020 at 3:30 AM Jérémie Girault <je...@hubvisor.io>
>> wrote:
>> >
>> > > Hello everyone,
>> > >
>> > > I previously asked a question on the ASF slack and someone replied to
>> me
>> > > by asking me to send the question on the dev list. I just subscribed
>> to the
>> > > list to forward the message I sent :
>> > >
>> > > I was playing with the DataSketches Quantiles Sketch module in druid
>> > > trying to retrieve some histograms using
>> quantilesDoublesSketchToHistogram.
>> > > However I couldn't label the values I retrieved for each bin when
>> using
>> > > numBins when trying to plot them.
>> > > I can’t seem to find any postAggregator that allows me to get min/max
>> > > values in order to recompute bins on the client side.
>> > > Should I use min/max aggregators when ingesting, and query them
>> alongside
>> > > my histogram as a workaround ? It seem a lot of space/time that would
>> seem
>> > > to be « free » to retrieve using Quantile Sketches.
>> > > Wouldn’t it be useful to have min/max postAggregators for
>> > > quantilesDoubleSketches aggregator and/or histogram bins labels ?
>> > > I located this chunk of code:
>> > >
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_blob_master_extensions-2Dcore_datasketches_src_main_java_org_apache_druid_query_aggregation_datasketches_quantiles_DoublesSketchToHistogramPostAggregator.java&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=PUk1rdn3YFgKzf5pRy7hKdCZt_J-_DZgbh_wjexBneI&s=fb3Uh150BuY9jtM8DqGofrqtwQrM9jDfupPq6MwF5hk&e=
>> > > That does not seem overly complicated in a way I could not
>> contribute, but
>> > > I’m not used to java dev these days and it would take me a while to
>> get it
>> > > right.
>> > > Would such features be considered if requested/submitted ?
>> > >
>> > > Thank you,
>> > >
>> > > --
>> > >
>> > > Jérémie Girault
>> > >
>>
>

Re: [E] quantilesDoubleSketches min/max postaggregators

Posted by Alexander Saydakov <sa...@verizonmedia.com.INVALID>.
I am not sure how important the compatibility with the current version is.
I am afraid I don't have time to work on this at the moment.
I would like to see some discussion about the best way forward.
Would you be willing to contribute this change once the community agrees on
the output format?

On Mon, Nov 9, 2020 at 3:36 AM Jérémie Girault <je...@hubvisor.io> wrote:

> Hello,
>
> This info about q0 and q1 is good to know, I will use it, thank you !
>
> As a user in order to plot that I would be glad to get the split points
> alongside the histogram values.
> It would be as useful to retrieve them from the `numBins` or `splitPoints`
> for consistency indeed: when I need to display histogram I don’t want to
> use two different code path to handle the request result.
>
> I can imagine different formats I could use with each pro and cons :
> - list of tuple: `[ [ <bin value>, <bin count> ], ... ]`
>         pro: simple
>         con: the format may be confusing without docs, breaks the current
> output format (can be solved by adding a flag controlling output)
> - list of objects: `[ { "value": <value>, "count": <count> }, ...]`
>         pro: simple, timeseries-like, probably the most easy to display
>         con: breaks the current output format (can be solved by adding a
> flag controlling output)
> - bins postAggregator + histogram values postAggregator : `{ bins: [ ...
> ], values: [ ... ] }`
>         pro: compatible with current format, feature is available on-demand
>         con: must zip arrays on client side
>
> What do you think ?
>
> --
>
> Jérémie Girault
> Le 6 nov. 2020 à 19:19 +0100, Alexander Saydakov <
> saydakov@verizonmedia.com.invalid>, a écrit :
> > quantile(0) = min value
> > quantile(1) = max value
> > you can use sketch-to-quantiles post agg to get min, max or any number of
> > other quantiles
> >
> > Regarding your observation that sketch-to-histogram(num bins) does not
> give
> > information about the computed split points. That is valuable feedback.
> > Perhaps, we could consider returning the split points somehow, but I am
> not
> > quite sure what the return type should be. We need to return two arrays:
> > probability mass in each bin as we do currently - that is one array of
> > doubles, and split points computed from min, max, and given number of
> bins.
> > And this post agg can accept split points - should we return them in that
> > case as well for consistency?
> >
> >
> > On Fri, Nov 6, 2020 at 3:30 AM Jérémie Girault <je...@hubvisor.io>
> wrote:
> >
> > > Hello everyone,
> > >
> > > I previously asked a question on the ASF slack and someone replied to
> me
> > > by asking me to send the question on the dev list. I just subscribed
> to the
> > > list to forward the message I sent :
> > >
> > > I was playing with the DataSketches Quantiles Sketch module in druid
> > > trying to retrieve some histograms using
> quantilesDoublesSketchToHistogram.
> > > However I couldn't label the values I retrieved for each bin when using
> > > numBins when trying to plot them.
> > > I can’t seem to find any postAggregator that allows me to get min/max
> > > values in order to recompute bins on the client side.
> > > Should I use min/max aggregators when ingesting, and query them
> alongside
> > > my histogram as a workaround ? It seem a lot of space/time that would
> seem
> > > to be « free » to retrieve using Quantile Sketches.
> > > Wouldn’t it be useful to have min/max postAggregators for
> > > quantilesDoubleSketches aggregator and/or histogram bins labels ?
> > > I located this chunk of code:
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_blob_master_extensions-2Dcore_datasketches_src_main_java_org_apache_druid_query_aggregation_datasketches_quantiles_DoublesSketchToHistogramPostAggregator.java&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=PUk1rdn3YFgKzf5pRy7hKdCZt_J-_DZgbh_wjexBneI&s=fb3Uh150BuY9jtM8DqGofrqtwQrM9jDfupPq6MwF5hk&e=
> > > That does not seem overly complicated in a way I could not contribute,
> but
> > > I’m not used to java dev these days and it would take me a while to
> get it
> > > right.
> > > Would such features be considered if requested/submitted ?
> > >
> > > Thank you,
> > >
> > > --
> > >
> > > Jérémie Girault
> > >
>

Re: [E] quantilesDoubleSketches min/max postaggregators

Posted by Jérémie Girault <je...@hubvisor.io>.
Hello,

This info about q0 and q1 is good to know, I will use it, thank you !

As a user in order to plot that I would be glad to get the split points alongside the histogram values.
It would be as useful to retrieve them from the `numBins` or `splitPoints` for consistency indeed: when I need to display histogram I don’t want to use two different code path to handle the request result.

I can imagine different formats I could use with each pro and cons :
- list of tuple: `[ [ <bin value>, <bin count> ], ... ]`
	pro: simple
	con: the format may be confusing without docs, breaks the current output format (can be solved by adding a flag controlling output)
- list of objects: `[ { "value": <value>, "count": <count> }, ...]`
	pro: simple, timeseries-like, probably the most easy to display
	con: breaks the current output format (can be solved by adding a flag controlling output)
- bins postAggregator + histogram values postAggregator : `{ bins: [ ... ], values: [ ... ] }`
	pro: compatible with current format, feature is available on-demand
	con: must zip arrays on client side

What do you think ?

--

Jérémie Girault
Le 6 nov. 2020 à 19:19 +0100, Alexander Saydakov <sa...@verizonmedia.com.invalid>, a écrit :
> quantile(0) = min value
> quantile(1) = max value
> you can use sketch-to-quantiles post agg to get min, max or any number of
> other quantiles
>
> Regarding your observation that sketch-to-histogram(num bins) does not give
> information about the computed split points. That is valuable feedback.
> Perhaps, we could consider returning the split points somehow, but I am not
> quite sure what the return type should be. We need to return two arrays:
> probability mass in each bin as we do currently - that is one array of
> doubles, and split points computed from min, max, and given number of bins.
> And this post agg can accept split points - should we return them in that
> case as well for consistency?
>
>
> On Fri, Nov 6, 2020 at 3:30 AM Jérémie Girault <je...@hubvisor.io> wrote:
>
> > Hello everyone,
> >
> > I previously asked a question on the ASF slack and someone replied to me
> > by asking me to send the question on the dev list. I just subscribed to the
> > list to forward the message I sent :
> >
> > I was playing with the DataSketches Quantiles Sketch module in druid
> > trying to retrieve some histograms using quantilesDoublesSketchToHistogram.
> > However I couldn't label the values I retrieved for each bin when using
> > numBins when trying to plot them.
> > I can’t seem to find any postAggregator that allows me to get min/max
> > values in order to recompute bins on the client side.
> > Should I use min/max aggregators when ingesting, and query them alongside
> > my histogram as a workaround ? It seem a lot of space/time that would seem
> > to be « free » to retrieve using Quantile Sketches.
> > Wouldn’t it be useful to have min/max postAggregators for
> > quantilesDoubleSketches aggregator and/or histogram bins labels ?
> > I located this chunk of code:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_blob_master_extensions-2Dcore_datasketches_src_main_java_org_apache_druid_query_aggregation_datasketches_quantiles_DoublesSketchToHistogramPostAggregator.java&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=PUk1rdn3YFgKzf5pRy7hKdCZt_J-_DZgbh_wjexBneI&s=fb3Uh150BuY9jtM8DqGofrqtwQrM9jDfupPq6MwF5hk&e=
> > That does not seem overly complicated in a way I could not contribute, but
> > I’m not used to java dev these days and it would take me a while to get it
> > right.
> > Would such features be considered if requested/submitted ?
> >
> > Thank you,
> >
> > --
> >
> > Jérémie Girault
> >