You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/02/10 12:36:52 UTC

[GitHub] [druid] cryptoe opened a new pull request #12253: Adding new config for disabling group by on multiValue column

cryptoe opened a new pull request #12253:
URL: https://github.com/apache/druid/pull/12253


   <!-- Thanks for trying to help us make Apache Druid be the best it can be! Please fill out as much of the following information as is possible (where relevant, and remove it when irrelevant) to help make the intention and scope of this PR clear in order to ease review. -->
   
   <!-- Please read the doc for contribution (https://github.com/apache/druid/blob/master/CONTRIBUTING.md) before making this PR. Also, once you open a PR, please _avoid using force pushes and rebasing_ since these make it difficult for reviewers to see what you've changed in response to their reviews. See [the 'If your pull request shows conflicts with master' section](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#if-your-pull-request-shows-conflicts-with-master) for more details. -->
   
   
   <!-- Replace XXXX with the id of the issue fixed in this PR. Remove this section if there is no corresponding issue. Don't reference the issue in the title of this pull-request. -->
   
   <!-- If you are a committer, follow the PR action item checklist for committers:
   https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers. -->
   
   ### Description
   As part of https://github.com/apache/druid/pull/12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key. 
   Added a config `groupByEnableMultiValueUnnesting` which can be set in the query context. 
   
   The default value of `groupByEnableMultiValueUnnesting` is `true`, therefore it does not change the current engine behavior. 
   If `groupByEnableMultiValueUnnesting` is set to false, the query will fail if it encounters a multi-value column in the grouping key. 
   
   <!-- Describe the goal of this PR, what problem are you fixing. If there is a corresponding issue (referenced above), it's not necessary to repeat the description here, however, you may choose to keep one summary sentence. -->
   
   <!-- Describe your patch: what did you change in code? How did you fix the problem? -->
   
   <!-- If there are several relatively logically separate changes in this PR, create a mini-section for each of them. For example: -->
   
   
   
   <!-- It's good to describe an alternative design (or mention an alternative name) for every design (or naming) decision point and compare the alternatives with the designs that you've implemented (or the names you've chosen) to highlight the advantages of the chosen designs and names. -->
   
   <!-- If there was a discussion of the design of the feature implemented in this PR elsewhere (e. g. a "Proposal" issue, any other issue, or a thread in the development mailing list), link to that discussion from this PR description and explain what have changed in your final design compared to your original proposal or the consensus version in the end of the discussion. If something hasn't changed since the original discussion, you can omit a detailed discussion of those aspects of the design here, perhaps apart from brief mentioning for the sake of readability of this PR description. -->
   
   <!-- Some of the aspects mentioned above may be omitted for simple and small changes. -->
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `GroupByQueryConfig`
    * `GroupByQueryEngineV2`
    * `GroupByQueryEngine`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cryptoe commented on a change in pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
cryptoe commented on a change in pull request #12253:
URL: https://github.com/apache/druid/pull/12253#discussion_r806136177



##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -375,3 +375,8 @@ This query returns the following result:
 Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered
 `dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline.
 Having specs are applied at the outermost level of groupBy query processing.
+
+## Disable GroupBy on multivalue columns
+
+As grouping on multivalue columns causes implicit unnest, users can avoid this behaviour by setting
+`groupByEnableMultiValueUnnesting` in the query context to `false`. This will result the query to error out.

Review comment:
       In the latest draft I have kept the documentation and changed the error messaging a bit. 
   Update the "GroupBy v2 configurations" as well. 

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
##########
@@ -327,35 +345,46 @@ public static int getCardinalityForArrayAggregation(
   }
 
   /**
-   * Checks whether all "dimensions" are either single-valued, or if the input column or output dimension spec has
-   * specified a type that {@link ColumnType#isArray()}. Both cases indicate we don't want to explode the under-lying
-   * multi value column. Since selectors on non-existent columns will show up as full of nulls, they are effectively
-   * single valued, however capabilites on columns can also be null, for example during broker merge with an 'inline'
-   * datasource subquery, so we only return true from this method when capabilities are fully known.
+   * Returns all dimension names that are or could be multi valued, or if the input column specified a type that
+   * {@link ColumnType#isArray()}. Both cases indicate we want to explode the under-lying multi value column. Since
+   * selectors on non-existent columns will show up as full of nulls, they are effectively single valued, however
+   * capabilites on columns can also be null, for example during broker merge with an 'inline' datasource subquery. We
+   * mark columns with null capabilites as candidates for explosion.
    */
-  public static boolean hasNoExplodingDimensions(
+  public static List<String> findAllProbableExplodingDimensions(

Review comment:
       Thanks for the catch. Updated the check while doing row by row iteration. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cryptoe commented on a change in pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
cryptoe commented on a change in pull request #12253:
URL: https://github.com/apache/druid/pull/12253#discussion_r804487958



##########
File path: processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryEngine.java
##########
@@ -91,7 +91,13 @@ public GroupByQueryEngine(
           "Null storage adapter found. Probably trying to issue a query against a segment being memory unmapped."
       );
     }
-
+    if (!query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true)) {
+      throw new UnsupportedOperationException(String.format(
+          "GroupBy v1 does not support %s as false. Set %s to true",

Review comment:
       Done

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
##########
@@ -228,6 +228,17 @@ public GroupByEngineIterator make()
                     processingBuffer
                 );
 
+                final boolean allSingleValueDims = hasNoExplodingDimensions(columnSelectorFactory,
+                                                                            query.getDimensions());
+                if (!(query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true))
+                    && !allSingleValueDims) {
+                  throw new ISE(

Review comment:
       Acked. Agreed such messaging would be better for the users




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cryptoe commented on a change in pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
cryptoe commented on a change in pull request #12253:
URL: https://github.com/apache/druid/pull/12253#discussion_r806136177



##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -375,3 +375,8 @@ This query returns the following result:
 Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered
 `dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline.
 Having specs are applied at the outermost level of groupBy query processing.
+
+## Disable GroupBy on multivalue columns
+
+As grouping on multivalue columns causes implicit unnest, users can avoid this behaviour by setting
+`groupByEnableMultiValueUnnesting` in the query context to `false`. This will result the query to error out.

Review comment:
       In the latest draft I have kept the documentation and changed the error messaging a bit. 
   Update the "GroupBy v2 configurations" as well. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #12253:
URL: https://github.com/apache/druid/pull/12253#discussion_r804064592



##########
File path: processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryConfig.java
##########
@@ -244,6 +253,10 @@ public GroupByQueryConfig withOverrides(final GroupByQuery query)
         getNumParallelCombineThreads()
     );
     newConfig.vectorize = query.getContextBoolean(QueryContexts.VECTORIZE_KEY, isVectorize());
+    newConfig.enableMultiValueUnnesting = query.getContextValue(

Review comment:
       Should use getContextBoolean for booleans.

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryEngine.java
##########
@@ -91,7 +91,13 @@ public GroupByQueryEngine(
           "Null storage adapter found. Probably trying to issue a query against a segment being memory unmapped."
       );
     }
-
+    if (!query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true)) {
+      throw new UnsupportedOperationException(String.format(
+          "GroupBy v1 does not support %s as false. Set %s to true",

Review comment:
       How about:
   
   > Set %s to true, or use groupBy v2.

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
##########
@@ -228,6 +228,17 @@ public GroupByEngineIterator make()
                     processingBuffer
                 );
 
+                final boolean allSingleValueDims = hasNoExplodingDimensions(columnSelectorFactory,
+                                                                            query.getDimensions());
+                if (!(query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true))
+                    && !allSingleValueDims) {
+                  throw new ISE(

Review comment:
       We should name the dimension so this error message is more actionable. People are going to need to know which dimension is causing the problem, so they can either remove it, set it to array mode, or process it in some other way.
   
   How about:
   
   1. Change hasNoExplodingDimensions to findExplodingDimensions (i.e., return the names of the exploding dimensions).
   2. Change the error message to:
   
   > Encountered multi-value dimensions [%s] that cannot be processed with %s set to false. Consider changing these dimensions to arrays or setting %s to true.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 merged pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 merged pull request #12253:
URL: https://github.com/apache/druid/pull/12253


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cryptoe commented on a change in pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
cryptoe commented on a change in pull request #12253:
URL: https://github.com/apache/druid/pull/12253#discussion_r806136905



##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
##########
@@ -327,35 +345,46 @@ public static int getCardinalityForArrayAggregation(
   }
 
   /**
-   * Checks whether all "dimensions" are either single-valued, or if the input column or output dimension spec has
-   * specified a type that {@link ColumnType#isArray()}. Both cases indicate we don't want to explode the under-lying
-   * multi value column. Since selectors on non-existent columns will show up as full of nulls, they are effectively
-   * single valued, however capabilites on columns can also be null, for example during broker merge with an 'inline'
-   * datasource subquery, so we only return true from this method when capabilities are fully known.
+   * Returns all dimension names that are or could be multi valued, or if the input column specified a type that
+   * {@link ColumnType#isArray()}. Both cases indicate we want to explode the under-lying multi value column. Since
+   * selectors on non-existent columns will show up as full of nulls, they are effectively single valued, however
+   * capabilites on columns can also be null, for example during broker merge with an 'inline' datasource subquery. We
+   * mark columns with null capabilites as candidates for explosion.
    */
-  public static boolean hasNoExplodingDimensions(
+  public static List<String> findAllProbableExplodingDimensions(

Review comment:
       Thanks for the catch. Updated the check while doing row by row iteration. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #12253:
URL: https://github.com/apache/druid/pull/12253#discussion_r807563062



##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
##########
@@ -889,12 +911,17 @@ private void aggregateMultiValueDims(IntGrouper grouper)
       }
 
       while (!cursor.isDone()) {
-        int multiValuesSize = multiValues.size();
+        final int multiValuesSize = multiValues.size();
         if (multiValuesSize == 0) {
           if (!grouper.aggregate(GroupByColumnSelectorStrategy.GROUP_BY_MISSING_VALUE).isOk()) {
             return;
           }
         } else {
+          if (multiValuesSize > 1) {
+            // this check is done during the row aggregation as a dimension can become multi-value col if
+            // {@link org.apache.druid.segment.column.ColumnCapabilities} is unkown.

Review comment:
       nit: same nit and typo

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
##########
@@ -764,6 +783,9 @@ protected void aggregateMultiValueDims(Grouper<ByteBuffer> grouper)
             );
 
             if (doAggregate) {
+              // this check is done during the row aggregation as a dimension can become multi-value col if
+              // {@link org.apache.druid.segment.column.ColumnCapabilities} is unkown.

Review comment:
       nit: javadoc links don't work here, maybe just say "column capabilities", also "unkown" -> "unknown"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #12253:
URL: https://github.com/apache/druid/pull/12253#discussion_r804721274



##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -375,3 +375,8 @@ This query returns the following result:
 Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered
 `dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline.
 Having specs are applied at the outermost level of groupBy query processing.
+
+## Disable GroupBy on multivalue columns
+
+As grouping on multivalue columns causes implicit unnest, users can avoid this behaviour by setting
+`groupByEnableMultiValueUnnesting` in the query context to `false`. This will result the query to error out.

Review comment:
       Alternatively we can keep this documented but change the error message to not mention array-based dimensions. In that case, we can change the error message to this:
   
   > Encountered multi-value dimension [%s] that cannot be processed with %s set to false. Consider setting %s to true for unnesting behavior, or using an expression to create a scalar from the multi-value dimension.
   
   For the docs, a couple style points:
   
   1. The rest of this page uses second person ("you can…") rather than third ("users can…") so we should stick to that.
   2. We usually use US spelling in documentation (e.g. behavior instead of behaviour).
   
   So I'd go with:
   
   > You can disable the implicit unnesting behavior for groupBy by setting `groupByEnableMultiValueUnnesting: false` in your query context. In this mode, the groupBy engine will return an error instead of completing the query. This is a safety feature for situations where you believe that all dimensions are singly-valued and want the engine to reject any multi-valued dimensions that were inadvertently included. 
   
   Also, all documented groupBy parameters should be included in the groupbyquery.md document as well, under "GroupBy v2 configurations". So if you mention this here it should be mentioned in the main doc too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #12253:
URL: https://github.com/apache/druid/pull/12253#discussion_r804703490



##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -375,3 +375,8 @@ This query returns the following result:
 Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered
 `dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline.
 Having specs are applied at the outermost level of groupBy query processing.
+
+## Disable GroupBy on multivalue columns
+
+As grouping on multivalue columns causes implicit unnest, users can avoid this behaviour by setting
+`groupByEnableMultiValueUnnesting` in the query context to `false`. This will result the query to error out.

Review comment:
       I think we should leave this out of the docs until we the array-based story is complete. Then we can document this, array types, array functions, etc.

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
##########
@@ -327,35 +345,46 @@ public static int getCardinalityForArrayAggregation(
   }
 
   /**
-   * Checks whether all "dimensions" are either single-valued, or if the input column or output dimension spec has
-   * specified a type that {@link ColumnType#isArray()}. Both cases indicate we don't want to explode the under-lying
-   * multi value column. Since selectors on non-existent columns will show up as full of nulls, they are effectively
-   * single valued, however capabilites on columns can also be null, for example during broker merge with an 'inline'
-   * datasource subquery, so we only return true from this method when capabilities are fully known.
+   * Returns all dimension names that are or could be multi valued, or if the input column specified a type that
+   * {@link ColumnType#isArray()}. Both cases indicate we want to explode the under-lying multi value column. Since
+   * selectors on non-existent columns will show up as full of nulls, they are effectively single valued, however
+   * capabilites on columns can also be null, for example during broker merge with an 'inline' datasource subquery. We
+   * mark columns with null capabilites as candidates for explosion.
    */
-  public static boolean hasNoExplodingDimensions(
+  public static List<String> findAllProbableExplodingDimensions(

Review comment:
       Hmm, looking through this I just realized that this function can't tell if dimensions are definitely multi-valued or not. It can only tell if they _might_ be multi-valued. (But they also might not be!)
   
   So I think we should move the check; instead of checking here, we should check row by row as we actually run the query. If a multi-value row is ever encountered, then at that point we should throw the error.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #12253: Adding new config for disabling group by on multiValue column

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #12253:
URL: https://github.com/apache/druid/pull/12253#issuecomment-1041207148


   Looks good after the changes. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org