You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Samarth Jain <sa...@gmail.com> on 2019/06/26 22:30:02 UTC

Returning only post aggregated results

Hi,

I recently contributed TDigest based sketch aggregators in Druid. It also
included a post aggregator that lets you generate quantiles from the
aggregated sketches.

Example query:

{
        "queryType": "groupBy",
        "dataSource": "test_datasource",
        "granularity": "ALL",
        "dimensions": [],
        "aggregations": [{
                "type": "mergeTDigestSketch",
                "name": "merged_sketch",
                "fieldName": "ingested_sketch",
                "compression": 200
        }],
        "postAggregations": [{
                "type": "quantilesFromTDigestSketch",
                "name": "quantiles",
                "fractions": [0, 0.5, 1],
                "field": {
                        "type": "fieldAccess",
                        "fieldName": "merged_sketch"
                }
        }],
        "intervals": ["2016-01-01T00:00:00.000Z/2016-01-31T00:00:00.000Z"]
}

The one limitation I have been running into is that the above query returns
both merged_sketch that was aggregated and the quantiles array that was
generated from applying post aggregation on merged_sketch. What I would
rather want in this case is for the query to just return the quantiles
array.

So instead of

"version": "v1",
        "timestamp": "2019-06-25T00:00:00.000Z",
        "event": {
             "quantiles": [
                0,
                162569.21411280808,
                5814934
            ],
            "merged_sketch": "AAAABBAXAS"
          }

I would prefer this:
"version": "v1",
        "timestamp": "2019-06-25T00:00:00.000Z",
        "event": {
             "quantiles": [
                0,
                162569.21411280808,
                5814934
            ]
          }

Is there a way to achieve this today? I tried changing post aggregation
field access from

"field": {
                        "type": "fieldAccess",
                        "fieldName": "merged_sketch"
                }

to

"field": {
                        "type": "finalizingFieldAccess",
                        "fieldName": "merged_sketch"
                }

but that didn't help either.

Thanks,
Samarth

Re: Returning only post aggregated results

Posted by Samarth Jain <sa...@gmail.com>.
Hi Clint,

Sorry for the delay in reply. This fell off my radar for a couple of weeks.

For my use case I would like the post aggregator to be able to return an
array of doubles. I don't see post aggregators in datasketches that do
that.

An example query is

SELECT
APPROX_QUANTILE_TDIGEST(SKETCH_STRING, [0.1, 0.2, 0.3])
FROM FOO
GROUP BY DIM1

which would then return an array of doubles containing values for each of
the quantiles.

The not so convenient workaround is for user to give multiple quantile
expressions
SELECT
APPROX_QUANTILE_TDIGEST(SKETCH_STRING, 0.1) ,
APPROX_QUANTILE_TDIGEST(SKETCH_STRING, 0.2) ,
APPROX_QUANTILE_TDIGEST(SKETCH_STRING, 0.3)
FROM FOO
GROUP BY DIM1

I am also not too sure about the performance impact of the workaround. '
My guess/hope is that the query engine is smart enough to do merging of
sketches only once with post aggregators working on the merged sketches
(instead of each post aggregator expression causing the merge to happen
multiple times)

Thanks,
Samarth






On Fri, Jun 28, 2019 at 3:26 AM Clint Wylie <cw...@apache.org> wrote:

> > Besides a lot of the use cases have multi valued dimensions which SQL
> standard doesn't support in general.
>
> I'd be happy to try and make the multi value/array functionality work for
> whatever your use case is, so if you have any feedback to give, either in
> this thread, or on the proposal
> https://github.com/apache/incubator-druid/issues/7525, that would be
> great.
>
> > On the note of SQL support, do you have know of any examples in Druid SQL
> > where a sql aggregation function returns an array of doubles? I looked at
> > DoubleSketchSqlAggregator but it seems to be returning a single double
> > value.
>
> If an existing agg/postagg combination (e.g.
>
> https://druid.apache.org/docs/latest/development/extensions-core/datasketches-tuple.html
> )
> doesn't provide what you need, then depending what it is you do need,
> something might be possible with the stuff I've been working on, though
> probably in a bit of a convoluted way (if at all). Thus far I've only added
> what I would consider synthetic support for array types, since they can
> only exist within the expression system, or as the serialized output of an
> expression post aggregator. Internally in Druid there are still only
> single/multi value string columns and single value long/float/double
> columns, so the rest of the query processing system is cannot operate
> directly on these array types. So, expression virtual columns which produce
> arrays must be coerced back into a native Druid type, which currently means
> probably either a string or a multi value string. If left as arrays, they
> automatically end up as a multi-value string. Using the 'array_to_string'
> function allows converting them into a single value, allowing grouping on
> the whole array in a sense. This joined string can then be fed into a
> 'string_to_array' expression post aggregator to split the strings back into
> the correct array type at the surface result level. Could you elaborate a
> bit more on what you are looking for?
>
> On Thu, Jun 27, 2019 at 10:44 PM Gian Merlino <gi...@apache.org> wrote:
>
> > Hey Samarth,
> >
> > > I think it would be a good contribution to add a select only certain
> > fields
> > > /projection feature for native queries. Not every team, for example at
> my
> > > work, have adopted to use the Druid SQL. They just have been so used to
> > > writing json queries ;). Besides a lot of the use cases have multi
> valued
> > > dimensions which SQL standard doesn't support in general.
> >
> > The SQL standard doesn't have anything really like our mutli-valued
> > dimensions, but, that doesn't stop us from trying to make them work in
> SQL
> > anyway. Clint has been doing a bunch of work here recently. Check out
> some
> > of these related PRs:
> >
> > - https://github.com/apache/incubator-druid/pull/7588
> > - https://github.com/apache/incubator-druid/pull/7973
> > - https://github.com/apache/incubator-druid/pull/7974
> >
> > > On the note of SQL support, do you have know of any examples in Druid
> SQL
> > > where a sql aggregation function returns an array of doubles? I looked
> at
> > > DoubleSketchSqlAggregator but it seems to be returning a single double
> > > value.
> >
> > I don't have an example, and I'm not sure if we've quite made it to
> arrays
> > of doubles yet, but Clint may be able to chime in with something
> > intelligent there.
> >
> > On Thu, Jun 27, 2019 at 1:44 PM Samarth Jain <sa...@gmail.com>
> > wrote:
> >
> > > Thanks for the reply, Gian. I am working on adding SQL support for the
> > > t-digest module.
> > >
> > > I think it would be a good contribution to add a select only certain
> > fields
> > > /projection feature for native queries. Not every team, for example at
> my
> > > work, have adopted to use the Druid SQL. They just have been so used to
> > > writing json queries ;). Besides a lot of the use cases have multi
> valued
> > > dimensions which SQL standard doesn't support in general.
> > >
> > > On the note of SQL support, do you have know of any examples in Druid
> SQL
> > > where a sql aggregation function returns an array of doubles? I looked
> at
> > > DoubleSketchSqlAggregator but it seems to be returning a single double
> > > value.
> > >
> > >
> > > On Wed, Jun 26, 2019 at 10:26 PM Gian Merlino <gi...@apache.org> wrote:
> > >
> > > > Hey Samarth,
> > > >
> > > > This kind of thing doable in Druid SQL, which will only return the
> > stuff
> > > > you SELECT. Native queries don't have a concept like that, so they
> > always
> > > > return everything, even if you intended certain things to be
> 'internal'
> > > > computations and aren't interested in seeing the results directly. If
> > it
> > > > makes sense for you to use SQL I would suggest going that route.
> > > Otherwise
> > > > it might be interesting to add a native query feature to select only
> > > > certain fields.
> > > >
> > > > On Wed, Jun 26, 2019 at 3:30 PM Samarth Jain <samarth.jain@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I recently contributed TDigest based sketch aggregators in Druid.
> It
> > > also
> > > > > included a post aggregator that lets you generate quantiles from
> the
> > > > > aggregated sketches.
> > > > >
> > > > > Example query:
> > > > >
> > > > > {
> > > > >         "queryType": "groupBy",
> > > > >         "dataSource": "test_datasource",
> > > > >         "granularity": "ALL",
> > > > >         "dimensions": [],
> > > > >         "aggregations": [{
> > > > >                 "type": "mergeTDigestSketch",
> > > > >                 "name": "merged_sketch",
> > > > >                 "fieldName": "ingested_sketch",
> > > > >                 "compression": 200
> > > > >         }],
> > > > >         "postAggregations": [{
> > > > >                 "type": "quantilesFromTDigestSketch",
> > > > >                 "name": "quantiles",
> > > > >                 "fractions": [0, 0.5, 1],
> > > > >                 "field": {
> > > > >                         "type": "fieldAccess",
> > > > >                         "fieldName": "merged_sketch"
> > > > >                 }
> > > > >         }],
> > > > >         "intervals":
> > > > ["2016-01-01T00:00:00.000Z/2016-01-31T00:00:00.000Z"]
> > > > > }
> > > > >
> > > > > The one limitation I have been running into is that the above query
> > > > returns
> > > > > both merged_sketch that was aggregated and the quantiles array that
> > was
> > > > > generated from applying post aggregation on merged_sketch. What I
> > would
> > > > > rather want in this case is for the query to just return the
> > quantiles
> > > > > array.
> > > > >
> > > > > So instead of
> > > > >
> > > > > "version": "v1",
> > > > >         "timestamp": "2019-06-25T00:00:00.000Z",
> > > > >         "event": {
> > > > >              "quantiles": [
> > > > >                 0,
> > > > >                 162569.21411280808,
> > > > >                 5814934
> > > > >             ],
> > > > >             "merged_sketch": "AAAABBAXAS"
> > > > >           }
> > > > >
> > > > > I would prefer this:
> > > > > "version": "v1",
> > > > >         "timestamp": "2019-06-25T00:00:00.000Z",
> > > > >         "event": {
> > > > >              "quantiles": [
> > > > >                 0,
> > > > >                 162569.21411280808,
> > > > >                 5814934
> > > > >             ]
> > > > >           }
> > > > >
> > > > > Is there a way to achieve this today? I tried changing post
> > aggregation
> > > > > field access from
> > > > >
> > > > > "field": {
> > > > >                         "type": "fieldAccess",
> > > > >                         "fieldName": "merged_sketch"
> > > > >                 }
> > > > >
> > > > > to
> > > > >
> > > > > "field": {
> > > > >                         "type": "finalizingFieldAccess",
> > > > >                         "fieldName": "merged_sketch"
> > > > >                 }
> > > > >
> > > > > but that didn't help either.
> > > > >
> > > > > Thanks,
> > > > > Samarth
> > > > >
> > > >
> > >
> >
>

Re: Returning only post aggregated results

Posted by Clint Wylie <cw...@apache.org>.
> Besides a lot of the use cases have multi valued dimensions which SQL
standard doesn't support in general.

I'd be happy to try and make the multi value/array functionality work for
whatever your use case is, so if you have any feedback to give, either in
this thread, or on the proposal
https://github.com/apache/incubator-druid/issues/7525, that would be great.

> On the note of SQL support, do you have know of any examples in Druid SQL
> where a sql aggregation function returns an array of doubles? I looked at
> DoubleSketchSqlAggregator but it seems to be returning a single double
> value.

If an existing agg/postagg combination (e.g.
https://druid.apache.org/docs/latest/development/extensions-core/datasketches-tuple.html)
doesn't provide what you need, then depending what it is you do need,
something might be possible with the stuff I've been working on, though
probably in a bit of a convoluted way (if at all). Thus far I've only added
what I would consider synthetic support for array types, since they can
only exist within the expression system, or as the serialized output of an
expression post aggregator. Internally in Druid there are still only
single/multi value string columns and single value long/float/double
columns, so the rest of the query processing system is cannot operate
directly on these array types. So, expression virtual columns which produce
arrays must be coerced back into a native Druid type, which currently means
probably either a string or a multi value string. If left as arrays, they
automatically end up as a multi-value string. Using the 'array_to_string'
function allows converting them into a single value, allowing grouping on
the whole array in a sense. This joined string can then be fed into a
'string_to_array' expression post aggregator to split the strings back into
the correct array type at the surface result level. Could you elaborate a
bit more on what you are looking for?

On Thu, Jun 27, 2019 at 10:44 PM Gian Merlino <gi...@apache.org> wrote:

> Hey Samarth,
>
> > I think it would be a good contribution to add a select only certain
> fields
> > /projection feature for native queries. Not every team, for example at my
> > work, have adopted to use the Druid SQL. They just have been so used to
> > writing json queries ;). Besides a lot of the use cases have multi valued
> > dimensions which SQL standard doesn't support in general.
>
> The SQL standard doesn't have anything really like our mutli-valued
> dimensions, but, that doesn't stop us from trying to make them work in SQL
> anyway. Clint has been doing a bunch of work here recently. Check out some
> of these related PRs:
>
> - https://github.com/apache/incubator-druid/pull/7588
> - https://github.com/apache/incubator-druid/pull/7973
> - https://github.com/apache/incubator-druid/pull/7974
>
> > On the note of SQL support, do you have know of any examples in Druid SQL
> > where a sql aggregation function returns an array of doubles? I looked at
> > DoubleSketchSqlAggregator but it seems to be returning a single double
> > value.
>
> I don't have an example, and I'm not sure if we've quite made it to arrays
> of doubles yet, but Clint may be able to chime in with something
> intelligent there.
>
> On Thu, Jun 27, 2019 at 1:44 PM Samarth Jain <sa...@gmail.com>
> wrote:
>
> > Thanks for the reply, Gian. I am working on adding SQL support for the
> > t-digest module.
> >
> > I think it would be a good contribution to add a select only certain
> fields
> > /projection feature for native queries. Not every team, for example at my
> > work, have adopted to use the Druid SQL. They just have been so used to
> > writing json queries ;). Besides a lot of the use cases have multi valued
> > dimensions which SQL standard doesn't support in general.
> >
> > On the note of SQL support, do you have know of any examples in Druid SQL
> > where a sql aggregation function returns an array of doubles? I looked at
> > DoubleSketchSqlAggregator but it seems to be returning a single double
> > value.
> >
> >
> > On Wed, Jun 26, 2019 at 10:26 PM Gian Merlino <gi...@apache.org> wrote:
> >
> > > Hey Samarth,
> > >
> > > This kind of thing doable in Druid SQL, which will only return the
> stuff
> > > you SELECT. Native queries don't have a concept like that, so they
> always
> > > return everything, even if you intended certain things to be 'internal'
> > > computations and aren't interested in seeing the results directly. If
> it
> > > makes sense for you to use SQL I would suggest going that route.
> > Otherwise
> > > it might be interesting to add a native query feature to select only
> > > certain fields.
> > >
> > > On Wed, Jun 26, 2019 at 3:30 PM Samarth Jain <sa...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I recently contributed TDigest based sketch aggregators in Druid. It
> > also
> > > > included a post aggregator that lets you generate quantiles from the
> > > > aggregated sketches.
> > > >
> > > > Example query:
> > > >
> > > > {
> > > >         "queryType": "groupBy",
> > > >         "dataSource": "test_datasource",
> > > >         "granularity": "ALL",
> > > >         "dimensions": [],
> > > >         "aggregations": [{
> > > >                 "type": "mergeTDigestSketch",
> > > >                 "name": "merged_sketch",
> > > >                 "fieldName": "ingested_sketch",
> > > >                 "compression": 200
> > > >         }],
> > > >         "postAggregations": [{
> > > >                 "type": "quantilesFromTDigestSketch",
> > > >                 "name": "quantiles",
> > > >                 "fractions": [0, 0.5, 1],
> > > >                 "field": {
> > > >                         "type": "fieldAccess",
> > > >                         "fieldName": "merged_sketch"
> > > >                 }
> > > >         }],
> > > >         "intervals":
> > > ["2016-01-01T00:00:00.000Z/2016-01-31T00:00:00.000Z"]
> > > > }
> > > >
> > > > The one limitation I have been running into is that the above query
> > > returns
> > > > both merged_sketch that was aggregated and the quantiles array that
> was
> > > > generated from applying post aggregation on merged_sketch. What I
> would
> > > > rather want in this case is for the query to just return the
> quantiles
> > > > array.
> > > >
> > > > So instead of
> > > >
> > > > "version": "v1",
> > > >         "timestamp": "2019-06-25T00:00:00.000Z",
> > > >         "event": {
> > > >              "quantiles": [
> > > >                 0,
> > > >                 162569.21411280808,
> > > >                 5814934
> > > >             ],
> > > >             "merged_sketch": "AAAABBAXAS"
> > > >           }
> > > >
> > > > I would prefer this:
> > > > "version": "v1",
> > > >         "timestamp": "2019-06-25T00:00:00.000Z",
> > > >         "event": {
> > > >              "quantiles": [
> > > >                 0,
> > > >                 162569.21411280808,
> > > >                 5814934
> > > >             ]
> > > >           }
> > > >
> > > > Is there a way to achieve this today? I tried changing post
> aggregation
> > > > field access from
> > > >
> > > > "field": {
> > > >                         "type": "fieldAccess",
> > > >                         "fieldName": "merged_sketch"
> > > >                 }
> > > >
> > > > to
> > > >
> > > > "field": {
> > > >                         "type": "finalizingFieldAccess",
> > > >                         "fieldName": "merged_sketch"
> > > >                 }
> > > >
> > > > but that didn't help either.
> > > >
> > > > Thanks,
> > > > Samarth
> > > >
> > >
> >
>

Re: Returning only post aggregated results

Posted by Gian Merlino <gi...@apache.org>.
Hey Samarth,

> I think it would be a good contribution to add a select only certain
fields
> /projection feature for native queries. Not every team, for example at my
> work, have adopted to use the Druid SQL. They just have been so used to
> writing json queries ;). Besides a lot of the use cases have multi valued
> dimensions which SQL standard doesn't support in general.

The SQL standard doesn't have anything really like our mutli-valued
dimensions, but, that doesn't stop us from trying to make them work in SQL
anyway. Clint has been doing a bunch of work here recently. Check out some
of these related PRs:

- https://github.com/apache/incubator-druid/pull/7588
- https://github.com/apache/incubator-druid/pull/7973
- https://github.com/apache/incubator-druid/pull/7974

> On the note of SQL support, do you have know of any examples in Druid SQL
> where a sql aggregation function returns an array of doubles? I looked at
> DoubleSketchSqlAggregator but it seems to be returning a single double
> value.

I don't have an example, and I'm not sure if we've quite made it to arrays
of doubles yet, but Clint may be able to chime in with something
intelligent there.

On Thu, Jun 27, 2019 at 1:44 PM Samarth Jain <sa...@gmail.com> wrote:

> Thanks for the reply, Gian. I am working on adding SQL support for the
> t-digest module.
>
> I think it would be a good contribution to add a select only certain fields
> /projection feature for native queries. Not every team, for example at my
> work, have adopted to use the Druid SQL. They just have been so used to
> writing json queries ;). Besides a lot of the use cases have multi valued
> dimensions which SQL standard doesn't support in general.
>
> On the note of SQL support, do you have know of any examples in Druid SQL
> where a sql aggregation function returns an array of doubles? I looked at
> DoubleSketchSqlAggregator but it seems to be returning a single double
> value.
>
>
> On Wed, Jun 26, 2019 at 10:26 PM Gian Merlino <gi...@apache.org> wrote:
>
> > Hey Samarth,
> >
> > This kind of thing doable in Druid SQL, which will only return the stuff
> > you SELECT. Native queries don't have a concept like that, so they always
> > return everything, even if you intended certain things to be 'internal'
> > computations and aren't interested in seeing the results directly. If it
> > makes sense for you to use SQL I would suggest going that route.
> Otherwise
> > it might be interesting to add a native query feature to select only
> > certain fields.
> >
> > On Wed, Jun 26, 2019 at 3:30 PM Samarth Jain <sa...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I recently contributed TDigest based sketch aggregators in Druid. It
> also
> > > included a post aggregator that lets you generate quantiles from the
> > > aggregated sketches.
> > >
> > > Example query:
> > >
> > > {
> > >         "queryType": "groupBy",
> > >         "dataSource": "test_datasource",
> > >         "granularity": "ALL",
> > >         "dimensions": [],
> > >         "aggregations": [{
> > >                 "type": "mergeTDigestSketch",
> > >                 "name": "merged_sketch",
> > >                 "fieldName": "ingested_sketch",
> > >                 "compression": 200
> > >         }],
> > >         "postAggregations": [{
> > >                 "type": "quantilesFromTDigestSketch",
> > >                 "name": "quantiles",
> > >                 "fractions": [0, 0.5, 1],
> > >                 "field": {
> > >                         "type": "fieldAccess",
> > >                         "fieldName": "merged_sketch"
> > >                 }
> > >         }],
> > >         "intervals":
> > ["2016-01-01T00:00:00.000Z/2016-01-31T00:00:00.000Z"]
> > > }
> > >
> > > The one limitation I have been running into is that the above query
> > returns
> > > both merged_sketch that was aggregated and the quantiles array that was
> > > generated from applying post aggregation on merged_sketch. What I would
> > > rather want in this case is for the query to just return the quantiles
> > > array.
> > >
> > > So instead of
> > >
> > > "version": "v1",
> > >         "timestamp": "2019-06-25T00:00:00.000Z",
> > >         "event": {
> > >              "quantiles": [
> > >                 0,
> > >                 162569.21411280808,
> > >                 5814934
> > >             ],
> > >             "merged_sketch": "AAAABBAXAS"
> > >           }
> > >
> > > I would prefer this:
> > > "version": "v1",
> > >         "timestamp": "2019-06-25T00:00:00.000Z",
> > >         "event": {
> > >              "quantiles": [
> > >                 0,
> > >                 162569.21411280808,
> > >                 5814934
> > >             ]
> > >           }
> > >
> > > Is there a way to achieve this today? I tried changing post aggregation
> > > field access from
> > >
> > > "field": {
> > >                         "type": "fieldAccess",
> > >                         "fieldName": "merged_sketch"
> > >                 }
> > >
> > > to
> > >
> > > "field": {
> > >                         "type": "finalizingFieldAccess",
> > >                         "fieldName": "merged_sketch"
> > >                 }
> > >
> > > but that didn't help either.
> > >
> > > Thanks,
> > > Samarth
> > >
> >
>

Re: Returning only post aggregated results

Posted by Samarth Jain <sa...@gmail.com>.
Thanks for the reply, Gian. I am working on adding SQL support for the
t-digest module.

I think it would be a good contribution to add a select only certain fields
/projection feature for native queries. Not every team, for example at my
work, have adopted to use the Druid SQL. They just have been so used to
writing json queries ;). Besides a lot of the use cases have multi valued
dimensions which SQL standard doesn't support in general.

On the note of SQL support, do you have know of any examples in Druid SQL
where a sql aggregation function returns an array of doubles? I looked at
DoubleSketchSqlAggregator but it seems to be returning a single double
value.


On Wed, Jun 26, 2019 at 10:26 PM Gian Merlino <gi...@apache.org> wrote:

> Hey Samarth,
>
> This kind of thing doable in Druid SQL, which will only return the stuff
> you SELECT. Native queries don't have a concept like that, so they always
> return everything, even if you intended certain things to be 'internal'
> computations and aren't interested in seeing the results directly. If it
> makes sense for you to use SQL I would suggest going that route. Otherwise
> it might be interesting to add a native query feature to select only
> certain fields.
>
> On Wed, Jun 26, 2019 at 3:30 PM Samarth Jain <sa...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I recently contributed TDigest based sketch aggregators in Druid. It also
> > included a post aggregator that lets you generate quantiles from the
> > aggregated sketches.
> >
> > Example query:
> >
> > {
> >         "queryType": "groupBy",
> >         "dataSource": "test_datasource",
> >         "granularity": "ALL",
> >         "dimensions": [],
> >         "aggregations": [{
> >                 "type": "mergeTDigestSketch",
> >                 "name": "merged_sketch",
> >                 "fieldName": "ingested_sketch",
> >                 "compression": 200
> >         }],
> >         "postAggregations": [{
> >                 "type": "quantilesFromTDigestSketch",
> >                 "name": "quantiles",
> >                 "fractions": [0, 0.5, 1],
> >                 "field": {
> >                         "type": "fieldAccess",
> >                         "fieldName": "merged_sketch"
> >                 }
> >         }],
> >         "intervals":
> ["2016-01-01T00:00:00.000Z/2016-01-31T00:00:00.000Z"]
> > }
> >
> > The one limitation I have been running into is that the above query
> returns
> > both merged_sketch that was aggregated and the quantiles array that was
> > generated from applying post aggregation on merged_sketch. What I would
> > rather want in this case is for the query to just return the quantiles
> > array.
> >
> > So instead of
> >
> > "version": "v1",
> >         "timestamp": "2019-06-25T00:00:00.000Z",
> >         "event": {
> >              "quantiles": [
> >                 0,
> >                 162569.21411280808,
> >                 5814934
> >             ],
> >             "merged_sketch": "AAAABBAXAS"
> >           }
> >
> > I would prefer this:
> > "version": "v1",
> >         "timestamp": "2019-06-25T00:00:00.000Z",
> >         "event": {
> >              "quantiles": [
> >                 0,
> >                 162569.21411280808,
> >                 5814934
> >             ]
> >           }
> >
> > Is there a way to achieve this today? I tried changing post aggregation
> > field access from
> >
> > "field": {
> >                         "type": "fieldAccess",
> >                         "fieldName": "merged_sketch"
> >                 }
> >
> > to
> >
> > "field": {
> >                         "type": "finalizingFieldAccess",
> >                         "fieldName": "merged_sketch"
> >                 }
> >
> > but that didn't help either.
> >
> > Thanks,
> > Samarth
> >
>

Re: Returning only post aggregated results

Posted by Gian Merlino <gi...@apache.org>.
Hey Samarth,

This kind of thing doable in Druid SQL, which will only return the stuff
you SELECT. Native queries don't have a concept like that, so they always
return everything, even if you intended certain things to be 'internal'
computations and aren't interested in seeing the results directly. If it
makes sense for you to use SQL I would suggest going that route. Otherwise
it might be interesting to add a native query feature to select only
certain fields.

On Wed, Jun 26, 2019 at 3:30 PM Samarth Jain <sa...@gmail.com> wrote:

> Hi,
>
> I recently contributed TDigest based sketch aggregators in Druid. It also
> included a post aggregator that lets you generate quantiles from the
> aggregated sketches.
>
> Example query:
>
> {
>         "queryType": "groupBy",
>         "dataSource": "test_datasource",
>         "granularity": "ALL",
>         "dimensions": [],
>         "aggregations": [{
>                 "type": "mergeTDigestSketch",
>                 "name": "merged_sketch",
>                 "fieldName": "ingested_sketch",
>                 "compression": 200
>         }],
>         "postAggregations": [{
>                 "type": "quantilesFromTDigestSketch",
>                 "name": "quantiles",
>                 "fractions": [0, 0.5, 1],
>                 "field": {
>                         "type": "fieldAccess",
>                         "fieldName": "merged_sketch"
>                 }
>         }],
>         "intervals": ["2016-01-01T00:00:00.000Z/2016-01-31T00:00:00.000Z"]
> }
>
> The one limitation I have been running into is that the above query returns
> both merged_sketch that was aggregated and the quantiles array that was
> generated from applying post aggregation on merged_sketch. What I would
> rather want in this case is for the query to just return the quantiles
> array.
>
> So instead of
>
> "version": "v1",
>         "timestamp": "2019-06-25T00:00:00.000Z",
>         "event": {
>              "quantiles": [
>                 0,
>                 162569.21411280808,
>                 5814934
>             ],
>             "merged_sketch": "AAAABBAXAS"
>           }
>
> I would prefer this:
> "version": "v1",
>         "timestamp": "2019-06-25T00:00:00.000Z",
>         "event": {
>              "quantiles": [
>                 0,
>                 162569.21411280808,
>                 5814934
>             ]
>           }
>
> Is there a way to achieve this today? I tried changing post aggregation
> field access from
>
> "field": {
>                         "type": "fieldAccess",
>                         "fieldName": "merged_sketch"
>                 }
>
> to
>
> "field": {
>                         "type": "finalizingFieldAccess",
>                         "fieldName": "merged_sketch"
>                 }
>
> but that didn't help either.
>
> Thanks,
> Samarth
>