You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "gianm (via GitHub)" <gi...@apache.org> on 2023/05/10 17:49:05 UTC

[GitHub] [druid] gianm commented on a diff in pull request #14237: Updates to filter processing for inner query in Joins

gianm commented on code in PR #14237:
URL: https://github.com/apache/druid/pull/14237#discussion_r1190215126


##########
processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinable.java:
##########
@@ -105,7 +106,7 @@ public ColumnValuesWithUniqueFlag getNonNullColumnValues(String columnName, fina
     try (final IndexedTable.Reader reader = table.columnReader(columnPosition)) {
       // Sorted set to encourage "in" filters that result from this method to do dictionary lookups in order.
       // The hopes are that this will improve locality and therefore improve performance.
-      final Set<String> allValues = createValuesSet();
+      final Set<String> allValues = createOrderedValuesSet();

Review Comment:
   FYI, InDimFilter wants a sorted set internally, so if you give it a non-sorted set, it'll need to copy it, which slows things down. So we really do want to keep this being a sorted set.
   
   Also, the comment is inaccurate, since InDimFilter always uses sorted sets internally, since #12517. (See ValuesSet.) Could you update it? Something like "Use a SortedSet so InDimFilter doesn't need to create its own".



##########
processing/src/main/java/org/apache/druid/segment/join/JoinableFactoryWrapper.java:
##########
@@ -178,15 +183,37 @@ static JoinClauseToFilterConversion convertJoinToFilter(
         }
 
         numValues -= columnValuesWithUniqueFlag.getColumnValues().size();
-        filters.add(Filters.toFilter(new InDimFilter(leftColumn, columnValuesWithUniqueFlag.getColumnValues())));
+        // For each column value which are received in order we increment the index
+        // and add it in the appropriate index in the map
+        int c = 0;
+        for (String val : columnValuesWithUniqueFlag.getColumnValues()) {

Review Comment:
   The logic here is inefficient since it creates `filterMap` full of `SelectorDimFilter` even if it ends up not being used (in the case where `clause.getCondition().getEquiConditions().size() == 1`). We need to be careful about this stuff, since the filters can be quite large.
   
   Also I am not sure if it's correct. It looks like this is doing a zip between all `clause.getJoinable().getNonNullColumnValues(condition.getRightColumn(), numValues)` for each equi-join condition. However, because `getNonNullColumnValues` returns _unique_ values, the different column values might not line up. I think if you add a test where the column values aren't all unique, you'll see this effect.
   
   IMO, it's OK to fix this bug by limiting the join-to-filter optimization to the case where there is a single equi-join. It's not clear that the optimization is actually helpful when there's more than one equi-join, since we have to use a bunch of selectors with a bunch of boolean operations, rather than an "in" filter. So, I'd suggest doing that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org