You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/27 00:07:28 UTC

[GitHub] [druid] imply-cheddar commented on pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

imply-cheddar commented on PR #12277:
URL: https://github.com/apache/druid/pull/12277#issuecomment-1258805916

   On the IN filter benchmarks, I've definitely seen client-generated queries with large IN filters.  Looker generates them sometimes, other tools also generate them and there are definitely applications that generate them.  It is programmatically generated, yes, but it also comes from external sources.  I forget if we've fixed this or not yet, but this is common enough that we have (or had?) a known issue around SQL parsing of large IN filters where it first puts them all into ORs before converting back to IN and our planning code for some reason likes to do `O(n^2)` passes over ORs.  Having a benchmark that also covers that case would be nice.
   
   When I look at the massive UNION ALL query in query 19, there is an actual difference between the timings for that query between GenericIndexed and the FrontCodedIndex.  I just wonder if that's because of the IN clause at the very end of that query...
   
   Just looked again at the `FrontCodedIndexBenchmark` and I see the `GenericIndexed` in there too.  I can't help but wonder if the differences there are actually from the fact that `FrontCoded` is dealing with UTF8 bytes instead of `String`.  I.e. all of the glut of dealing with String is equivalent to the glut of the extra objects and they are just equaling each other out.  If that's the case, then there's even more benefit to be had from the `FrontCodedIndexBenchmark`.  The other thing that's much more difficult to measure, but that we have seen is that garbage invokes GC and GC stops *all* queries from making progress.  This can limit the overall level of concurrency of the distributed system and can be difficult to measure in a benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org