You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/09/24 18:52:30 UTC

[GitHub] [druid] gianm commented on a change in pull request #10432: add vectorizeVirtualColumns query context parameter

gianm commented on a change in pull request #10432:
URL: https://github.com/apache/druid/pull/10432#discussion_r494536534



##########
File path: docs/querying/query-context.md
##########
@@ -92,7 +92,7 @@ include "selector", "bound", "in", "like", "regex", "search", "and", "or", and "
 - All aggregators must offer vectorized implementations. These include "count", "doubleSum", "floatSum", "longSum", "longMin",
  "longMax", "doubleMin", "doubleMax", "floatMin", "floatMax", "longAny", "doubleAny", "floatAny", "stringAny",
  "hyperUnique", "filtered", "approxHistogram", "approxHistogramFold", and "fixedBucketsHistogram" (with numerical input). 
-- No virtual columns.
+- All virtual columns must offer vectorized implementations.

Review comment:
       There's only one built-in virtual column: "expression". So, we should mention when that specific type of virtual column can be vectorized. (Does it depend on the type of expression? Types of input? Functions used? etc)

##########
File path: docs/querying/query-context.md
##########
@@ -106,3 +106,4 @@ vectorization. These query types will ignore the "vectorize" parameter even if i
 |--------|-------|------------|
 |vectorize|`true`|Enables or disables vectorized query execution. Possible values are `false` (disabled), `true` (enabled if possible, disabled otherwise, on a per-segment basis), and `force` (enabled, and groupBy or timeseries queries that cannot be vectorized will fail). The `"force"` setting is meant to aid in testing, and is not generally useful in production (since real-time segments can never be processed with vectorized execution, any queries on real-time data will fail). This will override `druid.query.default.context.vectorize` if it's set.|
 |vectorSize|`512`|Sets the row batching size for a particular query. This will override `druid.query.default.context.vectorSize` if it's set.|
+|vectorizeVirtualColumns|`true`|Enables or disables vectorized query processing of queries with virtual columns, layered on top of `vectorize` (`vectorize` must also be set to true for a query to utilize vectorization). Possible values are `false` (disabled), `true` (enabled if possible, disabled otherwise, on a per-segment basis), and `force` (enabled, and groupBy or timeseries queries with virtual columns that cannot be vectorized will fail). The `"force"` setting is meant to aid in testing, and is not generally useful in production. This will override `druid.query.default.context.vectorizeVirtualColumns` if it's set.|

Review comment:
       IMO, it'd be best to set it to `false` now, then change it to `true` for the next release. It'll lower the risk around upgrading, and we want upgrades to be really low risk so people feel comfortable just dropping in the newest version.
   
   The specific risks here are:
   
   1) Bugs in vectorized expressions.
   2) Situations where vectorized expressions perform worse than non-vectorized ones (perhaps some optimizations are implemented for the non-vectorized case, but not the vectorized case).
   3) Situations where there is a bug in the vectorized query engines _in general_, but a specific user isn't hitting it, because they're using expressions and so the query wouldn't vectorize anyway.

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/vector/VectorGroupByEngine.java
##########
@@ -76,10 +76,18 @@ public static boolean canVectorize(
     Function<String, ColumnCapabilities> capabilitiesFunction = name ->
         query.getVirtualColumns().getColumnCapabilitiesWithFallback(adapter, name);
 
+    final boolean adapterCanVectorize = adapter.canVectorize(filter, query.getVirtualColumns(), false);
+    final boolean virtualColumnsCanVectorize;
+    if (query.getVirtualColumns().getVirtualColumns().length > 0) {

Review comment:
       There's some duplication here between the query engines — how about creating a helper that takes the query and the virtual columns and creates a `Vectorize` instance that synthesizes both?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org