You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@druid.apache.org by Charles Allen <cr...@apache.org> on 2018/07/17 17:41:23 UTC

TopN folding and result ordering (and maybe group by)

I brought this up in the Dev Sync but thought I would write up a couple of
findings here.

We have some large results in TopN queries that come back, and have been
looking at optimizations in the TopN (or GroupBy) query path in order to
accommodate these larger results sets returning from many hundreds of nodes.

Looking at the TopN binary apply function
io.druid.query.topn.TopNBinaryFn#apply
<https://github.com/apache/incubator-druid/blob/druid-0.12.1/processing/src/main/java/io/druid/query/topn/TopNBinaryFn.java#L75-L135>
which does the result folding, there is a basic hash-join of the two
results in order to do the fold. This ends up with a lot of hash map
operations for creation and adding entries. I tried some really basic
optimizations to reduce the number of hash map operations in this function,
but they did not result in any measurable improvement in a real environment.

You can see some cpu time flame graphs in
https://github.com/apache/incubator-druid/pull/5913 . Ideally work should
be done in the aggregator combining functions rather than a whole bunch of
hash map state manipulations.

One potential improvement would be to move from a hash join to a merge
join. But such a scenario would require changing of the ordering of the
items so that they can be merge joined. The current ordering is based on
aggregation specification order. This change should allow iterating through
the topn result values only once, and have a simple way to insert new
values in one stream or another into the result topn result value.

This means the query path would sort the results on query time, and shuffle
the result to retain "specification order" only on the last stage out
(during a "finalize" kind of step).

In such a scenario, a potential future optimization would be to allow
results to be streamed back per topn result value. I *think* the current
implementation only considers the timestamp level, meaning if you do an ALL
granularity query, there is only one "chunk" of results that can be
streamed back. I haven't been digging deeply into this aspect though.

Such an optimization should be able to be applied to group by queries as
well, so I don't know if the folks working heavily on the group by queries
have considered this or alternatives.

My question is as follows:

Are there any issues people see for either using the Finalize flag of a
query to determine the sort order, or adding a new query context to
determine if the sort order should be specification order (default) or
lexicographic order (internal override) ?

Thanks,
Charles Allen