You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/02/20 07:36:26 UTC

[GitHub] clintropolis commented on issue #3878: Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values

clintropolis commented on issue #3878: Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values
URL: https://github.com/apache/incubator-druid/issues/3878#issuecomment-465459575
 
 
   I've also been lightly experimenting with this, as an offshoot of #6633, to try to get enough of a feel for things to put together a proposal for a more generic solution than in that PR. But I have been rather side tracked with some other things and haven't got back to it yet, so I can at least share my thoughts so far in case they are useful to you or someone else wants to run with it before I can pick it back up.
   
   Besides selectivity, my limited experiments lead me to think that the type of filter can also drive whether or not a filter should be done as a pre or a post filter, since not all filter value matching is equal. Particularly, things like bloom filters which need to compute a hash per value and test each hash against the bloom filter, the match itself is very expensive for high cardinality dimensions before it even gets to combining the bitmaps. #6633 illustrates that it can be useful in that case to always push to a post filter if the cardinality is high, even if the filter is highly selective.
   
   The hard part of this then would be figuring out how to encode this somehow so that the threshold can be a function of how expensive the filter is, assuming such classification is possible or useful in practice and that other filters exhibit a similar patterns to what I was observing with bloom filters which was my main focus so far. By side effect, introducing the idea that filters have some sort of cost value additionally opens up another optimization for evaluating 'and' filters, by giving a sensible mechanism to control the order in which filters are evaluated so that cheaper filters can be evaluated first and non-matches potentially shake out faster.
   
   It looks like some effort has gone into selectivity estimation for use with `search` queries in the past, but my gut tells me many of these estimates look fairly expensive, so i'm unsure of their utility for all query types. I have not verified this however, nor am I super familiar with these estimators, so it's possible I'm wrong. I was hoping experimentation would shake out whether or not knowing filter selectivity directly is even necessary. It might be well enough that we know that certain types of filters (selectors and combinations of selectors) should rarely if ever not be pre-filters, while others should generally always skip bitmaps if the cardinality is high enough, perhaps with the threshold adjusted based on how expensive the filter is or not. However, I suspect it's not actually that simple either, but hard to say without further experiments.
   
   I don't really know what this looks like yet implementation wise, too many unknowns at this point. All testing I have done so far has been very manual. For my next steps, I was planning on setting up a test harness that would let me manually control whether or not filters should use bitmap indexes similar to #6633, but maybe as an option to all filters, and then benchmark with parameters to run under a variety of conditions to find if there is indeed a 'break even' point and how it varies between filter types, overall dimension cardinality, and filter selectivity. But like I said, I'm not sure when I'll get to this, so if this interests you, I say have at it and I'll be more than happy to support with further discussion or review instead in order to see this get done :+1:
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org