You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/29 07:11:27 UTC

[GitHub] [druid] imply-cheddar commented on a diff in pull request #13085: Refactoring the data source before unnest

imply-cheddar commented on code in PR #13085:
URL: https://github.com/apache/druid/pull/13085#discussion_r983131194


##########
processing/src/main/java/org/apache/druid/query/Queries.java:
##########
@@ -191,11 +191,19 @@ public static <T> Query<T> withBaseDataSource(final Query<T> query, final DataSo
   {
     final Query<T> retVal;
 
-    if (query.getDataSource() instanceof QueryDataSource) {
-      final Query<?> subQuery = ((QueryDataSource) query.getDataSource()).getQuery();
+    /*
+     * Currently, this method is implemented in terms of a static walk doing a bunch of instanceof checks.
+     * We should likely look into moving this functionality into the DataSource object itself so that they
+     * can walk and create new objects on their own.  This will be necessary as we expand the set of DataSources
+     * that do actual work, as each of them will need to show up in this if/then waterfall.
+     */

Review Comment:
   I'm pretty sure it should be safe to move something logically equivalent to this method into a method on `DataSource` itself.  Let's do that inside of this refactor, so that we leave this refactor with hopefully never needing to have random methods of `instanceof` checks again.



##########
processing/src/main/java/org/apache/druid/query/JoinDataSource.java:
##########
@@ -127,12 +155,22 @@ public static JoinDataSource create(
       final String rightPrefix,
       final JoinConditionAnalysis conditionAnalysis,
       final JoinType joinType,
-      final DimFilter leftFilter
+      final DimFilter leftFilter,
+      @Nullable @JacksonInject final JoinableFactoryWrapper joinableFactoryWrapper

Review Comment:
   Having `@JacksonInject` on a method that isn't used by Jackson can be confusing (it will make future developers try to figure out how Jackson uses this method).  I don't think it needs to be there.



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/BaseLeafFrameProcessor.java:
##########
@@ -152,13 +153,23 @@ private boolean initializeSegmentMapFn(final IntSet readableInputs)
       segmentMapFn = Function.identity();
       return true;
     } else {
-      final boolean retVal = broadcastJoinHelper.buildBroadcastTablesIncrementally(readableInputs);
-
-      if (retVal) {
-        segmentMapFn = broadcastJoinHelper.makeSegmentMapFn(query);
+      if (query.getDataSource() instanceof InputNumberDataSource) {
+        final boolean retVal = broadcastJoinHelper.buildBroadcastTablesIncrementally(readableInputs);
+        if (retVal) {
+          InputNumberDataSource inputNumberDataSource = (InputNumberDataSource) query.getDataSource();
+          // The InputNumberData source was going through the broadcastJoinHelper which
+          // was using the JoinableFactoryWrapper to create segment map function.
+          // After refactoring, the segment map function creation is moved to data source
+          // Hence for InputNumberDataSource we are setting the broadcast join helper for the data source
+          // and moving the segment map function creation there

Review Comment:
   Sorry to pick on a comment, but this comment does what most comments in code do: it talks from the context of the developer who is writing the code.  Any new developer who comes and reads the code for the first time doesn't even know that this code was refactored, let alone know why it was refactored or what the state of the code was before the refactor.  Instead of explain the change of the refactor, the comment should attempt to help explain what the current, post-refactor code is attempting to accomplish.  I.e. something like
   
   ```
   // The InputNumberDataSource requires a BroadcastJoinHelper to be able to create its
   // segment map function.  It would be a lot better if the InputNumberDataSource actually
   // had a way to get that injected into it on its own, but the relationship between these objects
   // was figured out during a refactor and using a setter here seemed like the least-bad way to
   // make progress on the refactor without breaking functionality.  Hopefully, some future 
   // developer will move this away from a setter.
   ```
   
   Or something like that.



##########
extensions-core/multi-stage-query/src/test/java/org/apache/druid/msq/test/MSQTestBase.java:
##########
@@ -565,6 +572,8 @@ private ObjectMapper setupObjectMapper(Injector injector)
                                                                "compaction"
                                                            )
                                                        ).registerSubtypes(ExternalDataSource.class));
+    DruidSecondaryModule.setupJackson(injector, mapper);
+/*

Review Comment:
   Delete the code instead of comment it out please.



##########
server/src/main/java/org/apache/druid/server/LocalQuerySegmentWalker.java:
##########
@@ -94,12 +93,12 @@ public <T> QueryRunner<T> getQueryRunnerForIntervals(final Query<T> query, final
 
     final AtomicLong cpuAccumulator = new AtomicLong(0L);
 
-    final Function<SegmentReference, SegmentReference> segmentMapFn = joinableFactoryWrapper.createSegmentMapFn(
-        analysis.getJoinBaseTableFilter().map(Filters::toFilter).orElse(null),
-        analysis.getPreJoinableClauses(),
-        cpuAccumulator,
-        analysis.getBaseQuery().orElse(query)
-    );
+    Function<SegmentReference, SegmentReference> segmentMapFn = analysis.getDataSource()
+                                                                        .createSegmentMapFunction(
+                                                                            query,
+                                                                            cpuAccumulator
+                                                                        );
+

Review Comment:
   Now that there's a lot less arguments here, I'm pretty sure this formatting is no longer the best way to represent this code.  Please condense things.



##########
processing/src/main/java/org/apache/druid/query/InlineDataSource.java:
##########
@@ -102,6 +105,55 @@ public static InlineDataSource fromIterable(
     return new InlineDataSource(rows, signature);
   }
 
+  /**
+   * A very zealous equality checker for "rows" that respects deep equality of arrays, but nevertheless refrains
+   * from materializing things needlessly. Useful for unit tests that want to compare equality of different
+   * InlineDataSource instances.
+   */
+  private static boolean rowsEqual(final Iterable<Object[]> rowsA, final Iterable<Object[]> rowsB)

Review Comment:
   These private methods seem to have moved above some public methods, was it just your IDE doing stuff or was there a specific reason?  Generally speaking, the flow is always public methods first.



##########
processing/src/main/java/org/apache/druid/segment/join/JoinableFactoryWrapper.java:
##########
@@ -340,12 +192,60 @@ static JoinClauseToFilterConversion convertJoinToFilter(
     return new JoinClauseToFilterConversion(null, false);
   }
 
+  public JoinableFactory getJoinableFactory()
+  {
+    return joinableFactory;
+  }
+
+  /**
+   * Compute a cache key prefix for a join data source. This includes the data sources that participate in the RHS of a
+   * join as well as any query specific constructs associated with join data source such as base table filter. This key prefix
+   * can be used in segment level cache or result level cache. The function can return following wrapped in an

Review Comment:
   A bunch of this code looks like it changed, but I think it just had methods re-ordered?  Was that intentional?



##########
processing/src/main/java/org/apache/druid/segment/join/JoinableFactoryWrapper.java:
##########
@@ -340,12 +192,60 @@ static JoinClauseToFilterConversion convertJoinToFilter(
     return new JoinClauseToFilterConversion(null, false);
   }
 
+  public JoinableFactory getJoinableFactory()
+  {
+    return joinableFactory;
+  }
+
+  /**
+   * Compute a cache key prefix for a join data source. This includes the data sources that participate in the RHS of a
+   * join as well as any query specific constructs associated with join data source such as base table filter. This key prefix
+   * can be used in segment level cache or result level cache. The function can return following wrapped in an
+   * Optional
+   * - Non-empty byte array - If there is join datasource involved and caching is possible. The result includes
+   * join condition expression, join type and cache key returned by joinable factory for each {@link PreJoinableClause}
+   * - NULL - There is a join but caching is not possible. It may happen if one of the participating datasource
+   * in the JOIN is not cacheable.
+   *
+   * @param dataSourceAnalysis for the join datasource
+   * @return the optional cache key to be used as part of query cache key
+   * @throws {@link IAE} if this operation is called on a non-join data source
+   */
+  public Optional<byte[]> computeJoinDataSourceCacheKey(

Review Comment:
   The existence of this method is another point where things seem a little suspect.  I think the `DataSource` object needs a `public byte[] getCacheKey()` method.  Most of the current implementations would just return `new byte[]{}`, but the join one should return this (or null if not cacheable).  The unnest implementation will need to override the method as well. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org