You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/07/07 16:03:47 UTC

[GitHub] [incubator-druid] gianm commented on issue #7904: SQL planning slow with huge IN filters

gianm commented on issue #7904: SQL planning slow with huge IN filters
URL: https://github.com/apache/incubator-druid/issues/7904#issuecomment-509011366
 
 
   I did some looking into this and found not just one, but a few different root causes.
   
   1. Calcite has an O(N^2) OR simplification step in `RexSimplify.simplifyOr`: https://issues.apache.org/jira/browse/CALCITE-3178. This was the biggest contributor, and a hacky workaround of simply disabling this step cut the planning time down to ~8s.
   2. Druid's `Expressions.toSimpleLeafFilter` unconditionally attempts to check every "leaf" filter (non-boolean) to see if it's a time floor filter, for purposes of potentially converting `FLOOR(__time TO DAY) = X` into a range filter. This imposes a few seconds of overhead, which could be eliminated by only checking leaf filters that look like they _might_ be time floors (their operator matches `FLOOR`, `TIME_FLOOR`, or `CAST(x AS DATE)`).
   3. Druid's `CombineAndSimplifyBounds` builds a `TreeRangeSet<BoundValue>` out of every equality or bound leaf filter, to see if they can be simplified. Algorithmic complexity looks fine from a quick glance at the method, but there is overhead imposed by the fact that `BoundValue.compareTo`, which is called internally and often by the TreeRangeSet, always compares bounds as Strings. For numeric bounds involves a lot of wasteful converting of numbers and strings back and forth. This is only an issue when the bounds are numeric, but, in the case of my test query, they were (it was a long-typed column).
   4. Calcite's parser also seems to take a while (1–2s) to read through the SQL. Possibly #6974 would help with that, if it allows constructs like `col IN (?)` where `?` is an array.
   
   Profiling indicates that if all of the above were addressed, query planning time should come down to more acceptable levels.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org