You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/03/19 07:44:16 UTC

[GitHub] [druid] abhishekagarwal87 opened a new pull request #11014: [Draft] Enable multiple distinct aggregators in same query

abhishekagarwal87 opened a new pull request #11014:
URL: https://github.com/apache/druid/pull/11014


   ### Description
   
   Running queries with distinct aggregations on different fields require us to enable a calcite rule AggregateExpandDistinctAggregatesRule.INSTANCE which is currently not enabled. Instead `AggregateExpandDistinctAggregatesRule.JOIN rule is used which plans queries with multiple distinct aggregations as a Join query with a join condition of type `IS_NOT_DISTINCT_FROM`  but druid only supports joins with equality conditions. AggregateExpandDistinctAggregatesRule.INSTANCE rule, on the other hand, uses subtotal specs, to run the queries with distinct aggregations. This rule also requires the grouping aggregator which has been added very recently. 
   
   With AggregateExpandDistinctAggregatesRule.INSTANCE, query planning completes just fine however, after planning, the query execution fails. This is due to a bug in how druid procures merge buffers. Druid undercounts required merge buffers when there is a nested query and subquery has subtotals.
   
   This patch fixes the logic to compute required merge buffers. Additionally, a flag has been added to control to switch between old and new behavior. 
   
   <!-- Describe the goal of this PR, what problem are you fixing. If there is a corresponding issue (referenced above), it's not necessary to repeat the description here, however, you may choose to keep one summary sentence. -->
   
   <!-- Describe your patch: what did you change in code? How did you fix the problem? -->
   
   
   
   This PR has:
   - [ ] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #11014: Enable multiple distinct aggregators in same query

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #11014:
URL: https://github.com/apache/druid/pull/11014#discussion_r602633106



##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -131,6 +131,8 @@
 import java.util.Map;
 import java.util.stream.Collectors;
 
+import static org.apache.druid.sql.calcite.planner.PlannerConfig.CTX_KEY_USE_GROUPING_SET_FOR_EXACT_DISTINCT;

Review comment:
       iirc, I think we typically prefer to not use static imports, I know this is enforced in some places, but maybe not in test code because of some junit stuffs?

##########
File path: docs/configuration/index.md
##########
@@ -1675,6 +1675,7 @@ The Druid SQL server is configured through the following properties on the Broke
 |`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000|
 |`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
 |`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.|true|
+|`druid.sql.planner.useGroupingSetForExactDistinct`|Only relevant when `useApproximateCountDistinct` is disabled. If set to true, exact distinct queries are re-written using grouping sets. Otherwise, exact distinct queries are re-written using joins. This should be set to true for group by query with multiple exact distinct aggregations. This flag can be overridden per query.|false|

Review comment:
       naively this seems better maybe than using joins... is the reason to make it false by default in case there are any regressions I guess? I only ask because things that are cool, but off by default tend to take a long time to make it to being turned on, if ever.

##########
File path: processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryMergeBufferTest.java
##########
@@ -278,11 +285,12 @@ public void testDoubleNestedGroupBy()
         .setContext(ImmutableMap.of(QueryContexts.TIMEOUT_KEY, TIMEOUT))
         .build();
 
+    Assert.assertEquals(2, GroupByStrategyV2.countRequiredMergeBufferNum(query));
     GroupByQueryRunnerTestHelper.runQuery(FACTORY, runner, query);
 
     // This should be 0 because the broker needs 2 buffers and the queryable node needs one.
-    Assert.assertEquals(0, MERGE_BUFFER_POOL.getMinRemainBufferNum());
-    Assert.assertEquals(3, MERGE_BUFFER_POOL.getPoolSize());
+    Assert.assertEquals(1, MERGE_BUFFER_POOL.getMinRemainBufferNum());
+    Assert.assertEquals(4, MERGE_BUFFER_POOL.getPoolSize());

Review comment:
       nit: this comment isn't accurate anymore




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #11014: Enable multiple distinct aggregators in same query

Posted by GitBox <gi...@apache.org>.
clintropolis merged pull request #11014:
URL: https://github.com/apache/druid/pull/11014


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #11014: Enable multiple distinct aggregators in same query

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #11014:
URL: https://github.com/apache/druid/pull/11014#discussion_r607066223



##########
File path: docs/configuration/index.md
##########
@@ -1675,6 +1675,7 @@ The Druid SQL server is configured through the following properties on the Broke
 |`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000|
 |`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
 |`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.|true|
+|`druid.sql.planner.useGroupingSetForExactDistinct`|Only relevant when `useApproximateCountDistinct` is disabled. If set to true, exact distinct queries are re-written using grouping sets. Otherwise, exact distinct queries are re-written using joins. This should be set to true for group by query with multiple exact distinct aggregations. This flag can be overridden per query.|false|

Review comment:
       I didn't want to accidentally break any queries that are running already. At least for the backports, we do want to keep it off by default but maybe turn it on for new releases?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #11014: Enable multiple distinct aggregators in same query

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #11014:
URL: https://github.com/apache/druid/pull/11014#discussion_r607363683



##########
File path: docs/configuration/index.md
##########
@@ -1675,6 +1675,7 @@ The Druid SQL server is configured through the following properties on the Broke
 |`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000|
 |`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
 |`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.|true|
+|`druid.sql.planner.useGroupingSetForExactDistinct`|Only relevant when `useApproximateCountDistinct` is disabled. If set to true, exact distinct queries are re-written using grouping sets. Otherwise, exact distinct queries are re-written using joins. This should be set to true for group by query with multiple exact distinct aggregations. This flag can be overridden per query.|false|

Review comment:
       I think it would be ok if we could leave off by default for next release, and maybe consider turning on in the release after




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org