You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/06/30 21:49:45 UTC
[GitHub] [druid] jihoonson commented on a change in pull request #11213: add single input string expression dimension vector selector and better expression planning

jihoonson commented on a change in pull request #11213:
URL: https://github.com/apache/druid/pull/11213#discussion_r661131893



##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/vector/GroupByVectorColumnProcessorFactory.java
##########
@@ -117,4 +117,25 @@ public GroupByVectorColumnSelector makeObjectProcessor(
     }
     return NilGroupByVectorColumnSelector.INSTANCE;
   }
+
+  /**
+   * The group by engine vector processor has a more relaxed approach to choosing to use a dictionary encoded string
+   * selector over an object selector than some of the other {@link VectorColumnProcessorFactory} implementations.
+   *
+   * Basically, if a valid dictionary exists, we will use it to group on dictionary ids (so that we can use
+   * {@link SingleValueStringGroupByVectorColumnSelector} whenever possible instead of
+   * {@link DictionaryBuildingSingleValueStringGroupByVectorColumnSelector}).
+   *
+   * We do this even for things like virtual columns that have a single string input, because it allows deferring
+   * accessing any of the actual string values, which involves at minimum reading utf8 byte values and converting
+   * them to string form (if not already cached), and in the case of expressions, computing the expression output for
+   * the the string input.
+   */
+  @Override
+  public boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities)

Review comment:
       Please add `@Nullable` for `capabilities`.

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/vector/GroupByVectorColumnProcessorFactory.java
##########
@@ -117,4 +117,25 @@ public GroupByVectorColumnSelector makeObjectProcessor(
     }
     return NilGroupByVectorColumnSelector.INSTANCE;
   }
+
+  /**
+   * The group by engine vector processor has a more relaxed approach to choosing to use a dictionary encoded string
+   * selector over an object selector than some of the other {@link VectorColumnProcessorFactory} implementations.
+   *
+   * Basically, if a valid dictionary exists, we will use it to group on dictionary ids (so that we can use
+   * {@link SingleValueStringGroupByVectorColumnSelector} whenever possible instead of
+   * {@link DictionaryBuildingSingleValueStringGroupByVectorColumnSelector}).
+   *
+   * We do this even for things like virtual columns that have a single string input, because it allows deferring
+   * accessing any of the actual string values, which involves at minimum reading utf8 byte values and converting
+   * them to string form (if not already cached), and in the case of expressions, computing the expression output for
+   * the the string input.
+   */
+  @Override
+  public boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities)
+  {
+    Preconditions.checkArgument(capabilities != null, "Capabilities must not be null");
+    Preconditions.checkArgument(capabilities.getType() == ValueType.STRING, "Must only be called on a STRING column");
+    return capabilities.isDictionaryEncoded().isTrue();

Review comment:
       Just for recording (and helping myself to remember later in the future), this means that the groupBy vector engine will use the dictionary IDs to compute per-segment results, and decode them when merging those results, which is what non-vectorized engine does today. When the column is dictionary encoded but not unique, this optimization might not be always good because there could be some sort of tradeoff depending on the column cardinality post expression evaluation. Even though I think this optimization is likely good in most cases, it could worth investigating further later to understand the tradeoff better.

##########
File path: processing/src/main/java/org/apache/druid/segment/virtual/ExpressionVectorSelectors.java
##########
@@ -54,7 +55,13 @@ public static SingleValueDimensionVectorSelector makeSingleValueDimensionVectorS
       String constant = plan.getExpression().eval(ExprUtils.nilBindings()).asString();
       return ConstantVectorSelectors.singleValueDimensionVectorSelector(factory.getReadableVectorInspector(), constant);
     }
-    throw new IllegalStateException("Only constant expressions currently support dimension selectors");
+    if (plan.is(ExpressionPlan.Trait.SINGLE_INPUT_SCALAR) && ExprType.STRING == plan.getOutputType()) {

Review comment:
       Can we also use `SingleStringInputDeferredEvaluationExpressionDimensionVectorSelector` when the plan is `SINGLE_INPUT_MAPPABLE`?

##########
File path: processing/src/main/java/org/apache/druid/segment/virtual/ExpressionPlan.java
##########
@@ -122,61 +144,184 @@ public Expr getAppliedExpression()
     return expression;
   }
 
+  /**
+   * If an expression uses a multi-valued input in a scalar manner, and the expression contains an accumulator such as
+   * for use as part of an aggregator, the expression can be automatically transformed to fold the accumulator across
+   * the values of the original expression.
+   *
+   * @see Parser#foldUnappliedBindings(Expr, Expr.BindingAnalysis, List, String)
+   */
   public Expr getAppliedFoldExpression(String accumulatorId)
   {
     if (is(Trait.NEEDS_APPLIED)) {
+      Preconditions.checkState(
+          !unappliedInputs.contains(accumulatorId),
+          "Accumulator cannot be implicitly transformed, if it is an ARRAY or multi-valued type it must"
+          + " be used explicitly as such"
+      );
       return Parser.foldUnappliedBindings(expression, analysis, unappliedInputs, accumulatorId);
     }
     return expression;
   }
 
-  public Expr.BindingAnalysis getAnalysis()
-  {
-    return analysis;
-  }
-
-  public boolean is(Trait... flags)
-  {
-    return is(traits, flags);
-  }
-
-  public boolean any(Trait... flags)
-  {
-    return any(traits, flags);
-  }
-
+  /**
+   * The output type of the original expression.
+   *
+   * Note that this might not be the true for the expressions provided by {@link #getAppliedExpression()}
+   * or {@link #getAppliedFoldExpression(String)}, should the expression have any unapplied inputs
+   */
   @Nullable
   public ExprType getOutputType()
   {
     return outputType;
   }
 
+  /**
+   * If and only if the column has a single input, get the {@link ValueType} of that input
+   */
   @Nullable
   public ValueType getSingleInputType()
   {
     return singleInputType;
   }
 
+  /**
+   * If and only if the expression has a single input, get the name of that input
+   */
   public String getSingleInputName()
   {
     return Iterables.getOnlyElement(analysis.getRequiredBindings());
   }
 
+  /**
+   * Get set of inputs which were completely missing information, possibly a non-existent column or from a column
+   * selector factory with incomplete information
+   */
   public Set<String> getUnknownInputs()
   {
     return unknownInputs;
   }
 
+  /**
+   * Returns basic analysis of the inputs to an {@link Expr} and how they are used
+   *
+   * @see Expr.BindingAnalysis
+   */
+  public Expr.BindingAnalysis getAnalysis()
+  {
+    return analysis;
+  }
+
+  /**
+   * Tries to construct the most appropriate {@link ColumnCapabilities} for this plan given the {@link #outputType} and
+   * {@link #traits} inferred by the {@link ExpressionPlanner}, optionally with the help of hint {@link ValueType}.
+   *
+   * If no output type was able to be inferred during planning, returns null
+   */
+  @Nullable
+  public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint)
+  {
+    if (outputType != null) {
+      final ValueType inferredValueType = ExprType.toValueType(outputType);
+
+      if (inferredValueType.isNumeric()) {
+        // if float was explicitly specified preserve it, because it will currently never be the computed output type
+        // since there is no float expression type
+        if (ValueType.FLOAT == hint) {
+          return ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(ValueType.FLOAT);
+        }
+        return ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(inferredValueType);
+      }
+
+      // null constants can sometimes trip up the type inference to report STRING, so check if explicitly supplied
+      // output type is numeric and stick with that if so
+      if (hint != null && hint.isNumeric()) {
+        return ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(hint);
+      }
+
+      // fancy string stuffs
+      if (ValueType.STRING == inferredValueType) {
+        // constant strings are supported as dimension selectors, set them as dictionary encoded and unique for all the
+        // bells and whistles the engines have to offer
+        if (isConstant()) {
+          return ColumnCapabilitiesImpl.createSimpleSingleValueStringColumnCapabilities()
+                                       .setDictionaryEncoded(true)
+                                       .setDictionaryValuesUnique(true)
+                                       .setDictionaryValuesSorted(true)
+                                       .setHasNulls(expression.isNullLiteral());
+        }
+
+        // single input strings also have an optimization which allow defering evaluation time until dictionary encoded
+        // column lookup, so if the underlying column is a dictionary encoded string then we can report as such
+        if (any(Trait.SINGLE_INPUT_SCALAR, Trait.SINGLE_INPUT_MAPPABLE)) {
+          ColumnCapabilities underlyingCapabilities = baseInputInspector.getColumnCapabilities(getSingleInputName());
+          if (underlyingCapabilities != null) {
+            // since we don't know if the expression is 1:1 or if it retains ordering we can only piggy back only
+            // report as dictionary encoded, but it still allows us to use algorithms which work with dictionaryIds
+            // to create a dictionary encoded selector instead of an object selector to defer expression evaluation
+            // until query time
+            return ColumnCapabilitiesImpl.copyOf(underlyingCapabilities)
+                                         .setType(ValueType.STRING)
+                                         .setDictionaryValuesSorted(false)
+                                         .setDictionaryValuesUnique(false)
+                                         .setHasNulls(true);
+          }
+        }
+      }
+
+      // we don't have to check for unknown input here because output type is unable to be inferred if we don't know
+      // the complete set of input types
+      if (any(Trait.NON_SCALAR_OUTPUT, Trait.NEEDS_APPLIED)) {
+        // if the hint requested a string, use a string
+        if (ValueType.STRING == hint) {
+          return ColumnCapabilitiesImpl.createSimpleArrayColumnCapabilities(ValueType.STRING);
+        }
+        // maybe something is looking for a little fun and wants arrays? let whatever it is through
+        return ColumnCapabilitiesImpl.createSimpleArrayColumnCapabilities(ExprType.toValueType(outputType));
+      }
+
+      // if we got here, lets call it single value string output, non-dictionary encoded
+      return ColumnCapabilitiesImpl.createSimpleSingleValueStringColumnCapabilities();
+    }
+    // we don't know what we don't know
+    return null;

Review comment:
       Can we still infer in some cases if a non-null `hint` is provided, such as when `outputType` is null but `hint` is `ValueType.DOUBLE`? I'm not sure when this case can happen though.

##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnProcessors.java
##########
@@ -219,12 +219,26 @@ private static ColumnCapabilities computeDimensionSpecCapabilities(
     } else if (dimensionSpec.getExtractionFn() != null) {
       // DimensionSpec is applying an extractionFn but *not* decorating. We have some insight into how the
       // extractionFn will behave, so let's use it.
+      final boolean dictionaryEncoded;
+      final boolean unique;
+      final boolean sorted;
+      if (columnCapabilities != null) {
+        dictionaryEncoded = columnCapabilities.isDictionaryEncoded().isTrue();
+        unique = columnCapabilities.areDictionaryValuesUnique().isTrue();
+        sorted = columnCapabilities.areDictionaryValuesSorted().isTrue();
+      } else {
+        dictionaryEncoded = false;
+        unique = false;
+        sorted = false;
+      }
 
       return new ColumnCapabilitiesImpl()
           .setType(ValueType.STRING)
-          .setDictionaryValuesSorted(dimensionSpec.getExtractionFn().preservesOrdering())
-          .setDictionaryValuesUnique(dimensionSpec.getExtractionFn().getExtractionType()
-                                     == ExtractionFn.ExtractionType.ONE_TO_ONE)
+          .setDictionaryEncoded(dictionaryEncoded)
+          .setDictionaryValuesSorted(sorted && dimensionSpec.getExtractionFn().preservesOrdering())
+          .setDictionaryValuesUnique(
+              unique && dimensionSpec.getExtractionFn().getExtractionType() == ExtractionFn.ExtractionType.ONE_TO_ONE
+          )
           .setHasMultipleValues(dimensionSpec.mustDecorate() || mayBeMultiValue(columnCapabilities));

Review comment:
       nit: `dimensionSpec.mustDecorate()` always returns false here.

##########
File path: processing/src/main/java/org/apache/druid/segment/virtual/SingleStringInputDeferredEvaluationExpressionDimensionVectorSelector.java
##########
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.virtual;
+
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.math.expr.Expr;
+import org.apache.druid.math.expr.ExprType;
+import org.apache.druid.math.expr.vector.ExprVectorProcessor;
+import org.apache.druid.segment.DimensionDictionarySelector;
+import org.apache.druid.segment.IdLookup;
+import org.apache.druid.segment.vector.SingleValueDimensionVectorSelector;
+
+import javax.annotation.Nullable;
+
+/**
+ * A {@link SingleValueDimensionVectorSelector} decorator that directly exposes the underlying dictionary ids in
+ * {@link #getRowVector}, saving expression computation until {@link #lookupName} is called. This allows for
+ * performing operations like grouping on the native dictionary ids, and deferring expression evaluation until
+ * after, which can dramatically reduce the total number of evaluations.
+ *
+ * @see ExpressionVectorSelectors for details on how expression vector selectors are constructed.
+ *
+ * @see SingleStringInputDeferredEvaluationExpressionDimensionSelector for the non-vectorized version of this selector.
+ */
+public class SingleStringInputDeferredEvaluationExpressionDimensionVectorSelector
+    implements SingleValueDimensionVectorSelector
+{
+  private final SingleValueDimensionVectorSelector selector;
+  private final ExprVectorProcessor<String[]> stringProcessor;
+  private final StringVectorInputBindings inputBinding;
+
+  public SingleStringInputDeferredEvaluationExpressionDimensionVectorSelector(
+      SingleValueDimensionVectorSelector selector,
+      Expr expression
+  )
+  {
+    // Verify selector has a working dictionary.
+    if (selector.getValueCardinality() == DimensionDictionarySelector.CARDINALITY_UNKNOWN
+        || !selector.nameLookupPossibleInAdvance()) {
+      throw new ISE(
+          "Selector of class[%s] does not have a dictionary, cannot use it.",
+          selector.getClass().getName()
+      );
+    }
+    this.selector = selector;
+    this.inputBinding = new StringVectorInputBindings();
+    this.stringProcessor = expression.buildVectorized(inputBinding);
+  }
+
+  @Override
+  public int getValueCardinality()
+  {
+    return CARDINALITY_UNKNOWN;
+  }
+
+  @Nullable
+  @Override
+  public String lookupName(int id)
+  {
+    inputBinding.currentValue[0] = selector.lookupName(id);
+    return stringProcessor.evalVector(inputBinding).values()[0];
+  }
+
+  @Override
+  public boolean nameLookupPossibleInAdvance()
+  {
+    return true;
+  }
+
+  @Nullable
+  @Override
+  public IdLookup idLookup()
+  {
+    return null;
+  }
+
+  @Override
+  public int[] getRowVector()
+  {
+    return selector.getRowVector();
+  }
+
+  @Override
+  public int getMaxVectorSize()
+  {
+    return selector.getMaxVectorSize();
+  }
+
+  @Override
+  public int getCurrentVectorSize()
+  {
+    return selector.getCurrentVectorSize();
+  }
+
+  private static final class StringVectorInputBindings implements Expr.VectorInputBinding
+  {
+    private final String[] currentValue = new String[1];
+
+    @Nullable
+    @Override
+    public ExprType getType(String name)
+    {
+      return ExprType.STRING;
+    }
+
+    @Override
+    public int getMaxVectorSize()
+    {
+      return 1;

Review comment:
       It would worth mentioning why the vector size is 1. Can you add some comment about it?

##########
File path: processing/src/main/java/org/apache/druid/segment/virtual/ExpressionPlan.java
##########
@@ -122,61 +144,184 @@ public Expr getAppliedExpression()
     return expression;
   }
 
+  /**
+   * If an expression uses a multi-valued input in a scalar manner, and the expression contains an accumulator such as
+   * for use as part of an aggregator, the expression can be automatically transformed to fold the accumulator across
+   * the values of the original expression.
+   *
+   * @see Parser#foldUnappliedBindings(Expr, Expr.BindingAnalysis, List, String)
+   */
   public Expr getAppliedFoldExpression(String accumulatorId)
   {
     if (is(Trait.NEEDS_APPLIED)) {
+      Preconditions.checkState(
+          !unappliedInputs.contains(accumulatorId),
+          "Accumulator cannot be implicitly transformed, if it is an ARRAY or multi-valued type it must"
+          + " be used explicitly as such"
+      );
       return Parser.foldUnappliedBindings(expression, analysis, unappliedInputs, accumulatorId);
     }
     return expression;
   }
 
-  public Expr.BindingAnalysis getAnalysis()
-  {
-    return analysis;
-  }
-
-  public boolean is(Trait... flags)
-  {
-    return is(traits, flags);
-  }
-
-  public boolean any(Trait... flags)
-  {
-    return any(traits, flags);
-  }
-
+  /**
+   * The output type of the original expression.
+   *
+   * Note that this might not be the true for the expressions provided by {@link #getAppliedExpression()}
+   * or {@link #getAppliedFoldExpression(String)}, should the expression have any unapplied inputs
+   */
   @Nullable
   public ExprType getOutputType()
   {
     return outputType;
   }
 
+  /**
+   * If and only if the column has a single input, get the {@link ValueType} of that input
+   */
   @Nullable
   public ValueType getSingleInputType()
   {
     return singleInputType;
   }
 
+  /**
+   * If and only if the expression has a single input, get the name of that input
+   */
   public String getSingleInputName()
   {
     return Iterables.getOnlyElement(analysis.getRequiredBindings());
   }
 
+  /**
+   * Get set of inputs which were completely missing information, possibly a non-existent column or from a column
+   * selector factory with incomplete information
+   */
   public Set<String> getUnknownInputs()
   {
     return unknownInputs;
   }
 
+  /**
+   * Returns basic analysis of the inputs to an {@link Expr} and how they are used
+   *
+   * @see Expr.BindingAnalysis
+   */
+  public Expr.BindingAnalysis getAnalysis()
+  {
+    return analysis;
+  }
+
+  /**
+   * Tries to construct the most appropriate {@link ColumnCapabilities} for this plan given the {@link #outputType} and
+   * {@link #traits} inferred by the {@link ExpressionPlanner}, optionally with the help of hint {@link ValueType}.
+   *
+   * If no output type was able to be inferred during planning, returns null
+   */
+  @Nullable
+  public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType hint)

Review comment:
       ```suggestion
     public ColumnCapabilities inferColumnCapabilities(@Nullable ValueType outputTypeHint)
   ```

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/vector/GroupByVectorColumnProcessorFactory.java
##########
@@ -117,4 +117,25 @@ public GroupByVectorColumnSelector makeObjectProcessor(
     }
     return NilGroupByVectorColumnSelector.INSTANCE;
   }
+
+  /**
+   * The group by engine vector processor has a more relaxed approach to choosing to use a dictionary encoded string
+   * selector over an object selector than some of the other {@link VectorColumnProcessorFactory} implementations.
+   *
+   * Basically, if a valid dictionary exists, we will use it to group on dictionary ids (so that we can use
+   * {@link SingleValueStringGroupByVectorColumnSelector} whenever possible instead of
+   * {@link DictionaryBuildingSingleValueStringGroupByVectorColumnSelector}).
+   *
+   * We do this even for things like virtual columns that have a single string input, because it allows deferring
+   * accessing any of the actual string values, which involves at minimum reading utf8 byte values and converting
+   * them to string form (if not already cached), and in the case of expressions, computing the expression output for
+   * the the string input.

Review comment:
       ```suggestion
      * the string input.
   ```

##########
File path: processing/src/main/java/org/apache/druid/segment/virtual/VirtualizedColumnInspector.java
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.virtual;
+
+import org.apache.druid.segment.ColumnInspector;
+import org.apache.druid.segment.VirtualColumns;
+import org.apache.druid.segment.column.ColumnCapabilities;
+
+import javax.annotation.Nullable;
+
+/**
+ * Provides {@link ColumnCapabilities} for both virtual and non-virtual columns by building on top of another base
+ * {@link ColumnInspector}.
+ *
+ * {@link VirtualColumns} are provided with the base inspector so that they may potentially infer output types to
+ * construct the appropriate capabilities for virtual columns, while the base inspector directly supplies the
+ * capabilities for non-virtual columns.
+ */
+public class VirtualizedColumnInspector implements ColumnInspector

Review comment:
       :+1: 

##########
File path: processing/src/main/java/org/apache/druid/segment/VectorColumnProcessorFactory.java
##########
@@ -86,4 +88,27 @@ T makeMultiValueDimensionProcessor(
    * cases where the dictionary does not exist or is not expected to be useful.
    */
   T makeObjectProcessor(@SuppressWarnings("unused") ColumnCapabilities capabilities, VectorObjectSelector selector);
+
+  /**
+   * The processor factory can influence the decision on whether or not to prefer a dictionary encoded column value
+   * selector over a an object selector by examining the {@link ColumnCapabilities}.
+   *
+   * By default, all processor factories prefer to use a dictionary encoded selector if the column has a dictionary
+   * available ({@link ColumnCapabilities#isDictionaryEncoded()} is true), and there is a unique mapping of dictionary
+   * id to value ({@link ColumnCapabilities#areDictionaryValuesUnique()} is true), but this can be overridden
+   * if there is more appropriate behavior for a given processor.
+   *
+   * For processors, this means by default only actual dictionary encoded string columns (likely from real segments)
+   * will use {@link SingleValueDimensionVectorSelector} and {@link MultiValueDimensionVectorSelector}, while
+   * processors on things like string expression virtual columns will prefer to use {@link VectorObjectSelector}. In
+   * other words, it is geared towards use cases where there is a clear opportunity to benefit to deferring having to
+   * deal with the actual string value in exchange for the increased complexity of dealing with dictionary encoded
+   * selectors.
+   */
+  default boolean useDictionaryEncodedSelector(ColumnCapabilities capabilities)

Review comment:
       Please add `@Nullable` for `capabilities`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org