You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/11/05 08:34:14 UTC

[GitHub] [incubator-druid] clintropolis commented on issue #8822: optimize numeric column null value checking for low filter selectivity (more rows)

clintropolis commented on issue #8822: optimize numeric column null value checking for low filter selectivity (more rows)
URL: https://github.com/apache/incubator-druid/pull/8822#issuecomment-549717755
 
 
   >The heatmaps look super cool! (although I don't think I fully understand them yet :| ) What did you use to build them?
   
   Hah, thanks, I used R with ggplot2 to make them. I'll try to clean up the code and attach it, if I have a chance, in case anyone else wants to do some benchmarking tinker with the results. As for what they mean, I'll try my best to explain it as succinctly as possible 😅.
   
   The benchmark I added in this PR, `NullHandlingBitmapGetVsIteratorBenchmark`, is simulating approximately what happens during query processing on a historical for numerical null columns when used with something like a `NullableAggregator`, which is a wrapper around another `Aggregator` to ignore `null` values or delegate aggregation to the wrapped aggregator for rows that have actual values.
   
   When SQL compatible null handling is enabled, numeric columns are stored with 2 parts if nulls are present: the column itself, and a bitmap that has a set bit for each null value. At query time, filters are evaluated to compute something called an `Offset`, which is basically just the set of rows that are taking part in the query, and are used to create a column value/vector selector for those rows from the underlying column. Selectors have a `isNull` method which can be used to determine if a particular row is a `null`, and for numeric columns this is checking if that row is set on the bitmap. So mechanically, `NullableAggregator` will check each row from the selector to see if it is null (through the underlying bitmap), if it is, ignore it, but if not, delegate to the underlying `Aggregator` to do whatever it does to compute the result.
   
   The benchmark simplifies this concept into using a `BitSet` to simulate the `Offset`, an `ImmutableBitmap` for the null value bitmap, and a for loop that iterates over the "rows" selected by the `BitSet` to emulate the behavior of the aggregator on the selector, checking for set bits in the `ImmutableBitmap` for each index like `isNull` would be doing.
   
   Translating this into heatmap, the y axis is showing the effects of differences in density of the null bitmap (bottom is a few null values, top is nearly all rows are null), the x axis is the differences in the number of rows that our selector will select (left side selects very few rows, right scans nearly all rows), and the z axis is the difference in benchmark operation time between using bitmap.get` and using an iterator (or peekable iterator) from the null bitmap to move along with the iterator on the selectivity bitset. Further, some of the heatmaps have translated the raw benchmark times into the _time per row_ by scaling the time by how many rows are selected, to standardize measurement across the x axis, making it easier to compare the 2 strategies.
   
   Sorry, that didn't end up being so short... I .. hope this didn't make it more confusing 😜 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org