You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "imply-cheddar (via GitHub)" <gi...@apache.org> on 2023/03/01 01:52:25 UTC
[GitHub] [druid] imply-cheddar commented on a diff in pull request #13799: Now unnest allows bound, in and selector filters on the unnested column

imply-cheddar commented on code in PR #13799:
URL: https://github.com/apache/druid/pull/13799#discussion_r1121015434


##########
processing/src/main/java/org/apache/druid/query/UnnestDataSource.java:
##########
@@ -50,30 +48,26 @@ public class UnnestDataSource implements DataSource
   private final DataSource base;
   private final String column;
   private final String outputName;
-  private final LinkedHashSet<String> allowList;
 
   private UnnestDataSource(
       DataSource dataSource,
       String columnName,
-      String outputName,
-      LinkedHashSet<String> allowList
+      String outputName
   )

Review Comment:
   I had expected you to change the allowList to a Filter but keep it on this object.  It looks like you are assuming that the filter passed into the call to make the cursor is good enough.  In the "best" case, you are planning things to attach the filter on the unnested thing to the other filters being included, in which case, as clint mentions, the entire filter is moving to a value matcher with absolutely nothing being pushed down.  In the worst case, some of the filters are being lost in planning and incorrect results can occur.  I haven't validated which one of these is likely occuring yet.
   
   Given that Calcite very clearly puts the filter on the UNCOLLECT, you should be able to just attach a Filter object here instead of putting it on the query as a whole.  The logic inside of this should remain more or less the same as what it was before with the allow list, just, you are applying a value predicate to the dictionaries instead of building a bitset.
   
   The filter used to make the cursor should continue to be passed down (though it should also have the filter that is attached here adjusted and pushed down as well).



##########
sql/src/test/java/org/apache/druid/sql/calcite/CalciteArraysQueryTest.java:
##########
@@ -3341,4 +3328,467 @@ public void testUnnestWithConstant()
         )
     );
   }
+
+  @Test
+  public void testUnnestWithInFilterOnUnnestedCol()
+  {
+    skipVectorize();
+    cannotVectorize();
+    testQuery(
+        "SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3 IN ('a','b') ",
+        ImmutableList.of(
+            Druids.newScanQueryBuilder()
+                  .dataSource(UnnestDataSource.create(
+                      new TableDataSource(CalciteTests.DATASOURCE3),
+                      "dim3",
+                      "EXPR$0"
+                  ))
+                  .intervals(querySegmentSpec(Filtration.eternity()))
+                  .resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
+                  .legacy(false)
+                  .context(QUERY_CONTEXT_DEFAULT)
+                  .filters(new InDimFilter("EXPR$0", ImmutableList.of("a", "b"), null))
+                  .columns(ImmutableList.of(
+                      "EXPR$0"
+                  ))
+                  .build()
+        ),
+
+        ImmutableList.of(
+            new Object[]{"a"},
+            new Object[]{"b"},
+            new Object[]{"b"}
+        )
+    );
+  }
+
+  @Test
+  public void testUnnestWithInFilterOnUnnestedColWhereFilterIsNotOnFirstValue()
+  {
+    skipVectorize();
+    cannotVectorize();
+    testQuery(
+        "SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3 IN ('d','c') ",
+        ImmutableList.of(
+            Druids.newScanQueryBuilder()
+                  .dataSource(UnnestDataSource.create(
+                      new TableDataSource(CalciteTests.DATASOURCE3),
+                      "dim3",
+                      "EXPR$0"
+                  ))
+                  .intervals(querySegmentSpec(Filtration.eternity()))
+                  .resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
+                  .legacy(false)
+                  .context(QUERY_CONTEXT_DEFAULT)
+                  .filters(new InDimFilter("EXPR$0", ImmutableList.of("d", "c"), null))
+                  .columns(ImmutableList.of(
+                      "EXPR$0"
+                  ))
+                  .build()
+        ),
+
+        ImmutableList.of(
+            new Object[]{"c"},
+            new Object[]{"d"}
+        )
+    );
+  }
+
+  @Test
+  public void testUnnestWithInFilterOnUnnestedColWhereValuesDoNotExist()
+  {
+    skipVectorize();
+    cannotVectorize();
+    testQuery(
+        "SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3 IN ('foo','bar') ",
+        ImmutableList.of(
+            Druids.newScanQueryBuilder()
+                  .dataSource(UnnestDataSource.create(
+                      new TableDataSource(CalciteTests.DATASOURCE3),
+                      "dim3",
+                      "EXPR$0"
+                  ))
+                  .intervals(querySegmentSpec(Filtration.eternity()))
+                  .resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
+                  .legacy(false)
+                  .context(QUERY_CONTEXT_DEFAULT)
+                  .filters(new InDimFilter("EXPR$0", ImmutableList.of("foo", "bar"), null))
+                  .columns(ImmutableList.of(
+                      "EXPR$0"
+                  ))
+                  .build()
+        ),
+        ImmutableList.of()
+    );
+  }
+
+  @Test
+  public void testUnnestWithBoundFilterOnUnnestedCol()
+  {
+    skipVectorize();
+    cannotVectorize();
+    testQuery(
+        "SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where (d3>= 'b' AND d3 < 'd') ",
+        ImmutableList.of(
+            Druids.newScanQueryBuilder()
+                  .dataSource(UnnestDataSource.create(
+                      new TableDataSource(CalciteTests.DATASOURCE3),
+                      "dim3",
+                      "EXPR$0"
+                  ))
+                  .intervals(querySegmentSpec(Filtration.eternity()))
+                  .resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
+                  .legacy(false)
+                  .context(QUERY_CONTEXT_DEFAULT)
+                  .filters(bound("EXPR$0", "b", "d", false, true, null, StringComparators.LEXICOGRAPHIC))
+                  .columns(ImmutableList.of(
+                      "EXPR$0"
+                  ))
+                  .build()
+        ),
+
+        ImmutableList.of(
+            new Object[]{"b"},
+            new Object[]{"b"},
+            new Object[]{"c"}
+        )
+    );
+  }
+
+
+  @Test
+  public void testUnnestWithFilteringOnUnnestedVirtualCol()
+  {
+    skipVectorize();
+    cannotVectorize();
+    testQuery(
+        "SELECT d12 FROM druid.numfoo, UNNEST(ARRAY[m1, m2]) as unnested (d12) where d12 IN ('1','2') AND m1 < 10",
+        ImmutableList.of(
+            Druids.newScanQueryBuilder()
+                  .dataSource(UnnestDataSource.create(
+                      new QueryDataSource(
+                          newScanQueryBuilder()
+                              .dataSource(
+                                  new TableDataSource(CalciteTests.DATASOURCE3)
+                              )
+                              .intervals(querySegmentSpec(Filtration.eternity()))
+                              .resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
+                              .legacy(false)
+                              .filters(bound("m1", null, "10", false, true, null, StringComparators.NUMERIC))
+                              .columns(
+                                  "__time",
+                                  "cnt",
+                                  "d1",
+                                  "d2",
+                                  "dim1",
+                                  "dim2",
+                                  "dim3",
+                                  "dim4",
+                                  "dim5",
+                                  "dim6",
+                                  "f1",
+                                  "f2",
+                                  "l1",
+                                  "l2",
+                                  "m1",
+                                  "m2",
+                                  "unique_dim1"
+                              )
+                              .context(QUERY_CONTEXT_DEFAULT)
+                              .build()
+                      ),
+                      "v0",
+                      "EXPR$0"
+                  ))
+                  .intervals(querySegmentSpec(Filtration.eternity()))
+                  .resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
+                  .legacy(false)
+                  .context(QUERY_CONTEXT_DEFAULT)
+                  .virtualColumns(expressionVirtualColumn("v0", "array(\"m1\",\"m2\")", ColumnType.FLOAT_ARRAY))
+                  .filters(new InDimFilter("EXPR$0", ImmutableList.of("1.0", "2.0"), null))
+                  .columns(ImmutableList.of(
+                      "EXPR$0"
+                  ))
+                  .build()
+        ),
+
+        ImmutableList.of(
+            new Object[]{1.0f},
+            new Object[]{1.0f},
+            new Object[]{2.0f},
+            new Object[]{2.0f}
+        )
+    );
+  }

Review Comment:
   So, from looking at how things are planned, I thought that we were throwing away the actual where clause.  Instead, this test showed me what is actually happening: when we have both filters, the unnest is being planned on top of a scan query which carries the original filter.
   
   This is a bad plan, it is going to needlessly force things to run on the Broker instead of being pushed down.  The unnest should be pushed down onto the actual TableDataSource and the native query here should be a singular scan on top of an unnest data source of a table reference.



##########
processing/src/main/java/org/apache/druid/segment/UnnestDimensionCursor.java:
##########
@@ -203,8 +202,10 @@ public Class<?> classOfObject()
           @Override
           public int getValueCardinality()
           {
-            if (!allowedBitSet.isEmpty()) {
-              return allowedBitSet.cardinality();
+            if (allowFilter instanceof InDimFilter) {
+              return ((InDimFilter) allowFilter).getValues().size();
+            } else if (allowFilter instanceof AndFilter) {
+              return ((AndFilter) allowFilter).getFilters().size();

Review Comment:
   We cannot eliminate the bitset here.  You need to be producing the bitset by applying the value matcher to the dictionary of values.



##########
processing/src/main/java/org/apache/druid/segment/UnnestDimensionCursor.java:
##########
@@ -351,43 +361,25 @@ private void initialize()
   private void advanceAndUpdate()
   {
     if (indexedIntsForCurrentRow == null) {
-      index = 0;
+      indexForRow = 0;
       if (!baseCursor.isDone()) {
         baseCursor.advanceUninterruptibly();
       }
     } else {
-      if (index >= indexedIntsForCurrentRow.size() - 1) {
+      if (indexForRow >= indexedIntsForCurrentRow.size() - 1) {

Review Comment:
   IndexedInt is a class in the Druid code base.  It's logically equivalent to an `int[]`.  The values are references to the dictionary.  There is some javadoc that attempts to explain these things on interfaces like `DimensionSelector` and `DimensionDictionarySelector`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org