You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/01/02 09:34:51 UTC

[GitHub] gianm opened a new pull request #6794: Query vectorization.

gianm opened a new pull request #6794: Query vectorization.
URL: https://github.com/apache/incubator-druid/pull/6794
 
 
   See https://static.imply.io/gianm/vb.html for benchmarks and analysis.
   
   By "vectorization" I mean enabling processors (like filters, aggregators, and query engines) to operate on batches of rows at once, rather than one row at a time. It allows queries to be sped up by reducing the number of method calls, allowing more cache-efficiency, and potentially enabling CPU SIMD instructions. (At least, in theory it makes the latter possible, but I haven't looked into whether such instructions are actually being generated in this patch.)
   
   This would represent a major change for how Druid query engines work, but one that would be beneficial, and a long time coming (see #3011 for some earlier discussion). Most of the new stuff is in new classes, so the old code isn't touched much. For our collective sanity, we may want to consider working on projects that allow us to remove the non-vectorized code paths. This would include vectorizing other query engines (notably topN), adding vectorization support for virtual columns and descending order (neither of which are supported in this patch), and adding vectorized implementations for all aggregators, filters, and dimension specs that don't currently have them (see below for a list). And last but not least: adapting all the tests to exercise the vectorized code paths -- this patch only does a handful.
   
   This patch includes vectorized **timeseries** and **groupBy** engines, as well as some **analogs of your favorite Druid classes**:
   
   - VectorCursor is like Cursor. (It comes from StorageAdapter.makeVectorCursor.)
   - VectorColumnSelectorFactory is like ColumnSelectorFactory, and it has methods to create analogs of the column selectors you know and love.
   - VectorOffset and ReadableVectorOffset are like Offset and ReadableOffset.
   - VectorAggregator is like BufferAggregator.
   - VectorValueMatcher is like ValueMatcher.
   
   There are some **notable differences** between vectorized and regular execution:
   
   - Regular cursors have a single DimensionSelector class, but vector cursors have both SingleValueDimensionVectorSelector and MultiValueDimensionVectorSelector. This is done because it allows the singly-valued selector to use a primitive `int[]` array for a batch of rows.
   - Unlike regular cursors, vector cursors do not understand time granularity. They expect query engines to handle this on their own, which a new VectorCursorGranularizer class helps with. This is to avoid too much batch-splitting and to respect the fact that vector selectors are somewhat more heavyweight than regular selectors.
   - Unlike FilteredOffset, FilteredVectorOffset does not leverage indexes for filters that might partially support them (like an OR of one filter that supports indexing and another that doesn't). I'm not sure that this behavior is desirable anyway (it is potentially too eager) but, at any rate, it'd be better to harmonize it between the two classes. Potentially they should both do some different thing that is smarter than what either of them is doing right now.
   - When vector cursors are created by QueryableIndexCursorSequenceBuilder, they use a morphing binary-then-linear search to find their start and end rows, rather than linear search.
   
   **Limitations** in this patch are:
   
   - Only timeseries and groupBy have vectorized engines.
   - GroupBy doesn't handle multi-value dimensions yet.
   - Vector cursors cannot handle virtual columns or descending order.
   - Expressions are not supported anywhere: not as inputs to aggregators, in virtual functions, or in filters.
   - Only some aggregators have vectorized implementations: "count", "doubleSum", "floatSum", "longSum", "hyperUnique", and "filtered".
   - Only some filters have vectorized matchers: "selector", "bound", "in", "like", "regex", "search", "and", "or", and "not".
   - Dimension specs other than "default" don't work yet (no extraction functions or filtered dimension specs).
   
   I believe this list of limitations is exhaustive; meaning, the feature will be "complete" when everything on that list is taken care of.
   
   Currently, the testing strategy includes adding vectorization-enabled tests to TimeseriesQueryRunnerTest, GroupByQueryRunnerTest, GroupByTimeseriesQueryRunnerTest, CalciteQueryTest, and all of the filtering tests that extend BaseFilterTest. In all of those classes, there are some test cases that don't support vectorization. They are marked by special function calls like "cannotVectorize" or "skipVectorize" that tell the test harness to either expect an exception or to skip the test case.
   
   Testing should be expanded in the future -- a project in and of itself.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org