You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/12/14 12:40:12 UTC

[GitHub] [druid] LakshSingla opened a new pull request #12065: Return empty result when a group by gets optimized to a timeseries query

LakshSingla opened a new pull request #12065:
URL: https://github.com/apache/druid/pull/12065


   ### Description
   Related to #11188 
   
   The above mentioned PR allowed timeseries queries to return a default result, when queries of type: `select count(*) from table where dim1="_not_present_dim_"` were executed. Before the PR, it returned no row, after the PR, it would return a row with value of `count(*)` as 0 (as expected by SQL standards of different dbs).
   
   In `Grouping#applyProject`, we can sometimes perform optimization of a groupBy query to a timeseries query if possible (when the keys of the groupBy are constants, as generated by automated tools). For example, in `select count(*) from table where dim1="_present_dim_" group by "dummy_key"`, the groupBy clause can be removed. However, in the case when the filter doesn't return anything, i.e. `select count(*) from table where dim1="_not_present_dim_" group by "dummy_key"`, the behavior of general databases would be to return nothing, while druid (due to above change) returns an empty row. This PR aims to fix this divergence of behavior.
   
   Example cases:
   1. `select count(*) from table where dim1="_not_present_dim_" group by "dummy_key"`.
   _CURRENT_: Returns a row with count(*) = 0
   _EXPECTED_: Return no row
   
   2. `select 'A', dim1 from foo where m1 = 123123 and dim1 = '_not_present_again_' group by dim1`
   _CURRENT_: Returns a row with ('A', 'wat')
   _EXPECTED_: Return no row
   
   <!--
   In each section, please describe design decisions made, including:
    - Choice of algorithms
    - Behavioral aspects. What configuration values are acceptable? How are corner cases and error conditions handled, such as when there are insufficient resources?
    - Class organization and design (how the logic is split between classes, inheritance, composition, design patterns)
    - Method organization and design (how the logic is split between methods, parameters and return types)
    - Naming (class, method, API, configuration, HTTP endpoint, names of emitted metrics)
   -->
   
   To do this, a boolean `droppedDimensionsWhileApplyingProject` has been added to `Grouping` which is true whenever we make changes to the original shape with optimization. Hence if a timeseries query has a grouping with this set to true, we set `skipEmptyBuckets=true` in the query context (i.e. donot return any row).
   <hr>
   
   ##### Key changed/added classes in this PR
    * `Grouping.java`
    * `DruidQuery#toTimeseriesQuery`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #12065:
URL: https://github.com/apache/druid/pull/12065#issuecomment-1007631973


   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on pull request #12065:
URL: https://github.com/apache/druid/pull/12065#issuecomment-1007627868


   Sure thing @jihoonson 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on pull request #12065:
URL: https://github.com/apache/druid/pull/12065#issuecomment-1007542691


   Merged since test failures are unrelated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #12065:
URL: https://github.com/apache/druid/pull/12065#issuecomment-1007604300






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 merged pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 merged pull request #12065:
URL: https://github.com/apache/druid/pull/12065


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 merged pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 merged pull request #12065:
URL: https://github.com/apache/druid/pull/12065


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #12065:
URL: https://github.com/apache/druid/pull/12065#discussion_r777366385



##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/rel/Grouping.java
##########
@@ -70,12 +71,25 @@ private Grouping(
       @Nullable final DimFilter havingFilter,
       final RowSignature outputRowSignature
   )
+  {
+    this(dimensions, subtotals, aggregations, havingFilter, outputRowSignature, false);
+  }
+
+  private Grouping(
+      final List<DimensionExpression> dimensions,
+      final Subtotals subtotals,
+      final List<Aggregation> aggregations,
+      @Nullable final DimFilter havingFilter,
+      final RowSignature outputRowSignature,
+      final boolean droppedDimensionsWhileApplyingProject
+  )
   {
     this.dimensions = ImmutableList.copyOf(dimensions);
     this.subtotals = Preconditions.checkNotNull(subtotals, "subtotals");
     this.aggregations = ImmutableList.copyOf(aggregations);
     this.havingFilter = havingFilter;
     this.outputRowSignature = Preconditions.checkNotNull(outputRowSignature, "outputRowSignature");
+    this.droppedDimensionsWhileApplyingProject = droppedDimensionsWhileApplyingProject;

Review comment:
       can we rename it to something else e.g. hadGroupingDimensions or groupDimensionsDropped or something? basically, we need not add `whileApplyingProject` here

##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -13342,4 +13340,34 @@ public void testCommonVirtualExpressionWithDifferentValueType() throws Exception
         ImmutableList.of()
     );
   }
+
+  // When optimization in Grouping#applyProject is applied, and it reduces a Group By query to a timeseries, we
+  // want it to return empty bucket if no row matches
+  @Test
+  public void testReturnEmptyRowWhenGroupByIsConvertedToTimeseries() throws Exception

Review comment:
       can you add few more queries such as one that groups on a dummy dimension

##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -13342,4 +13340,34 @@ public void testCommonVirtualExpressionWithDifferentValueType() throws Exception
         ImmutableList.of()
     );
   }
+
+  // When optimization in Grouping#applyProject is applied, and it reduces a Group By query to a timeseries, we
+  // want it to return empty bucket if no row matches
+  @Test
+  public void testReturnEmptyRowWhenGroupByIsConvertedToTimeseries() throws Exception
+  {
+    skipVectorize();
+    testQuery(
+        "SELECT 'A', dim1 from foo WHERE m1 = 50 AND dim1 = 'wat' GROUP BY dim1",

Review comment:
       btw I noticed that it runs as a group-by query if we remove the `m1 = 50` clause. do you know why? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on pull request #12065:
URL: https://github.com/apache/druid/pull/12065#issuecomment-1007542691






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] LakshSingla commented on a change in pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
LakshSingla commented on a change in pull request #12065:
URL: https://github.com/apache/druid/pull/12065#discussion_r777401695



##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -13342,4 +13340,34 @@ public void testCommonVirtualExpressionWithDifferentValueType() throws Exception
         ImmutableList.of()
     );
   }
+
+  // When optimization in Grouping#applyProject is applied, and it reduces a Group By query to a timeseries, we
+  // want it to return empty bucket if no row matches
+  @Test
+  public void testReturnEmptyRowWhenGroupByIsConvertedToTimeseries() throws Exception
+  {
+    skipVectorize();
+    testQuery(
+        "SELECT 'A', dim1 from foo WHERE m1 = 50 AND dim1 = 'wat' GROUP BY dim1",

Review comment:
       On removing the `m1=50` clause, the `dim1` is not getting reduced to `wat` literal in the Calcite planner phase, so the optimization in `Grouping.java` to eliminate the literals is not getting applied. 
   I checked the place where this reduction is happening, and it's in the `ProjectMergeRule` of Calcite. When there's a single project, then `ProjectMergeRule` is not getting invoked, which is causing this reduction to literal. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] LakshSingla commented on a change in pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
LakshSingla commented on a change in pull request #12065:
URL: https://github.com/apache/druid/pull/12065#discussion_r780051096



##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -13342,4 +13340,34 @@ public void testCommonVirtualExpressionWithDifferentValueType() throws Exception
         ImmutableList.of()
     );
   }
+
+  // When optimization in Grouping#applyProject is applied, and it reduces a Group By query to a timeseries, we
+  // want it to return empty bucket if no row matches
+  @Test
+  public void testReturnEmptyRowWhenGroupByIsConvertedToTimeseries() throws Exception

Review comment:
       Added a few more test cases, which show the consequences of optimization in different cases. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] LakshSingla commented on a change in pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
LakshSingla commented on a change in pull request #12065:
URL: https://github.com/apache/druid/pull/12065#discussion_r777401695



##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -13342,4 +13340,34 @@ public void testCommonVirtualExpressionWithDifferentValueType() throws Exception
         ImmutableList.of()
     );
   }
+
+  // When optimization in Grouping#applyProject is applied, and it reduces a Group By query to a timeseries, we
+  // want it to return empty bucket if no row matches
+  @Test
+  public void testReturnEmptyRowWhenGroupByIsConvertedToTimeseries() throws Exception
+  {
+    skipVectorize();
+    testQuery(
+        "SELECT 'A', dim1 from foo WHERE m1 = 50 AND dim1 = 'wat' GROUP BY dim1",

Review comment:
       When removing the `m1=50` clause, the `dim1` is not getting reduced to `wat` literal in the Calcite planner phase, so the optimization in `Grouping.java` to eliminate the literals is not getting applied. 
   I checked the place where this reduction is happening, and it's in the `ProjectMergeRule` of Calcite. When there's a single project, then `ProjectMergeRule` is not getting invoked, which is causing this reduction to literal. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #12065:
URL: https://github.com/apache/druid/pull/12065#issuecomment-1007604300


   @abhishekagarwal87 I feel our test is too flaky recently. Can you file an issue for flaky tests in this PR, so that we can promote and fix it? If we have one already, please link it to here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] LakshSingla commented on a change in pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
LakshSingla commented on a change in pull request #12065:
URL: https://github.com/apache/druid/pull/12065#discussion_r777401695



##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -13342,4 +13340,34 @@ public void testCommonVirtualExpressionWithDifferentValueType() throws Exception
         ImmutableList.of()
     );
   }
+
+  // When optimization in Grouping#applyProject is applied, and it reduces a Group By query to a timeseries, we
+  // want it to return empty bucket if no row matches
+  @Test
+  public void testReturnEmptyRowWhenGroupByIsConvertedToTimeseries() throws Exception
+  {
+    skipVectorize();
+    testQuery(
+        "SELECT 'A', dim1 from foo WHERE m1 = 50 AND dim1 = 'wat' GROUP BY dim1",

Review comment:
       On removing the `m1=50` clause, the `dim1` is not getting reduced to `wat` literal in the Calcite planner phase, so the optimization in `Grouping.java` to eliminate the literals is not getting applied. 
   I checked the place where this reduction is happening, and it's in the `ProjectMergeRule` of Calcite. When there's a single project, then `ProjectMergeRule` is not getting invoked. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] LakshSingla commented on a change in pull request #12065: Return empty result when a group by gets optimized to a timeseries query

Posted by GitBox <gi...@apache.org>.
LakshSingla commented on a change in pull request #12065:
URL: https://github.com/apache/druid/pull/12065#discussion_r780051096



##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -13342,4 +13340,34 @@ public void testCommonVirtualExpressionWithDifferentValueType() throws Exception
         ImmutableList.of()
     );
   }
+
+  // When optimization in Grouping#applyProject is applied, and it reduces a Group By query to a timeseries, we
+  // want it to return empty bucket if no row matches
+  @Test
+  public void testReturnEmptyRowWhenGroupByIsConvertedToTimeseries() throws Exception

Review comment:
       Added a few more test cases, which show the consequences of optimization in different cases. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org