You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/12/23 02:45:39 UTC

[GitHub] [druid] clintropolis commented on a change in pull request #12078: Grouping on arrays as arrays

clintropolis commented on a change in pull request #12078:
URL: https://github.com/apache/druid/pull/12078#discussion_r773583393



##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/column/ArrayGroupByColumnSelectorStrategy.java
##########
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.groupby.epinephelinae.column;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.query.groupby.ResultRow;
+import org.apache.druid.query.groupby.epinephelinae.Grouper;
+import org.apache.druid.query.ordering.StringComparator;
+import org.apache.druid.query.ordering.StringComparators;
+import org.apache.druid.segment.ColumnValueSelector;
+import org.apache.druid.segment.DimensionSelector;
+import org.apache.druid.segment.data.ComparableIntArray;
+import org.apache.druid.segment.data.ComparableStringArray;
+import org.apache.druid.segment.data.IndexedInts;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+public class ArrayGroupByColumnSelectorStrategy

Review comment:
       this class only handles strings, so its name is perhaps incorrect, or it should be made more generic to handle other types of arrays

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/column/ArrayGroupByColumnSelectorStrategy.java
##########
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.groupby.epinephelinae.column;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.query.groupby.ResultRow;
+import org.apache.druid.query.groupby.epinephelinae.Grouper;
+import org.apache.druid.query.ordering.StringComparator;
+import org.apache.druid.query.ordering.StringComparators;
+import org.apache.druid.segment.ColumnValueSelector;
+import org.apache.druid.segment.DimensionSelector;
+import org.apache.druid.segment.data.ComparableIntArray;
+import org.apache.druid.segment.data.ComparableStringArray;
+import org.apache.druid.segment.data.IndexedInts;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+public class ArrayGroupByColumnSelectorStrategy
+    implements GroupByColumnSelectorStrategy
+{
+  private static final int GROUP_BY_MISSING_VALUE = -1;
+
+
+  // contains string <-> id for each element of the multi value grouping column
+  // for eg : [a,b,c] is the col value. dictionaryToInt will contain { a <-> 1, b <-> 2, c <-> 3}
+  private final BiMap<String, Integer> dictionaryToInt;
+
+  // stores each row as a integer array where the int represents the value in dictionaryToInt
+  // for eg : [a,b,c] would be converted to [1,2,3] and assigned a integer value 1.
+  // [1,2,3] <-> 1
+  private final BiMap<ComparableIntArray, Integer> intListToInt;
+
+  @Override
+  public int getGroupingKeySize()
+  {
+    return Integer.BYTES;
+  }
+
+  public ArrayGroupByColumnSelectorStrategy()
+  {
+    dictionaryToInt = HashBiMap.create();
+    intListToInt = HashBiMap.create();
+  }
+
+  @VisibleForTesting
+  ArrayGroupByColumnSelectorStrategy(
+      BiMap<String, Integer> dictionaryToInt,
+      BiMap<ComparableIntArray, Integer> intArrayToInt
+  )
+  {
+    this.dictionaryToInt = dictionaryToInt;
+    this.intListToInt = intArrayToInt;
+  }
+
+  @Override
+  public void processValueFromGroupingKey(
+      GroupByColumnSelectorPlus selectorPlus,
+      ByteBuffer key,
+      ResultRow resultRow,
+      int keyBufferPosition
+  )
+  {
+    final int id = key.getInt(keyBufferPosition);
+
+    // GROUP_BY_MISSING_VALUE is used to indicate empty rows
+    if (id != GROUP_BY_MISSING_VALUE) {
+      final int[] intRepresentation = intListToInt.inverse()
+                                                  .get(id).getDelegate();
+      final String[] stringRepresentaion = new String[intRepresentation.length];
+      for (int i = 0; i < intRepresentation.length; i++) {
+        stringRepresentaion[i] = dictionaryToInt.inverse().get(intRepresentation[i]);
+      }
+      resultRow.set(selectorPlus.getResultRowPosition(), ComparableStringArray.of(stringRepresentaion));
+    } else {
+      resultRow.set(selectorPlus.getResultRowPosition(), ComparableStringArray.of(NullHandling.defaultStringValues()));

Review comment:
       why wouldn't we use null for null? this seems incorrect to use an empty array, and comparators down the line seem to use nulls first so having nulls seems like it would be ok. am i missing something?

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/ResultRowDeserializer.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.groupby;
+
+import com.fasterxml.jackson.core.JsonParser;
+import com.fasterxml.jackson.core.JsonToken;
+import com.fasterxml.jackson.databind.DeserializationContext;
+import com.fasterxml.jackson.databind.JsonDeserializer;
+import org.apache.druid.data.input.Row;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.column.RowSignature;
+import org.apache.druid.segment.data.ComparableStringArray;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+
+public class ResultRowDeserializer extends JsonDeserializer<ResultRow>
+{
+  final List<ColumnType> types;
+  final GroupByQuery query;
+
+  public ResultRowDeserializer(final List<ColumnType> types, final GroupByQuery query)
+  {
+    this.types = types;
+    this.query = query;
+  }
+
+  public static ResultRowDeserializer fromQuery(
+      final GroupByQuery query
+  )
+  {
+    RowSignature rowSignature = query.getResultRowSignature();
+    final List<ColumnType> types = new ArrayList<>(rowSignature.size());
+
+    for (String name : rowSignature.getColumnNames()) {
+      final ColumnType type = rowSignature.getColumnType(name)
+                                          .orElseThrow(() -> new ISE("No type for column [%s]", name));
+
+      types.add(type);
+    }
+
+    return new ResultRowDeserializer(types, query);
+
+  }
+
+  @Override
+  public ResultRow deserialize(JsonParser jp, DeserializationContext ctxt) throws IOException
+  {
+    // Deserializer that can deserialize either array- or map-based rows.
+    if (jp.isExpectedStartObjectToken()) {
+      final Row row = jp.readValueAs(Row.class);
+      return ResultRow.fromLegacyRow(row, query);
+    } else if (jp.isExpectedStartArrayToken()) {
+      final Object[] retVal = new Object[types.size()];
+
+      for (int i = 0; i < types.size(); i++) {
+        final JsonToken token = jp.nextToken();
+        switch (types.get(i).getType()) {
+          case STRING:
+            if (token == JsonToken.VALUE_NULL) {
+              retVal[i] = null;
+            } else if (token == JsonToken.VALUE_STRING) {
+              retVal[i] = jp.getText();
+            } else {
+              throw ctxt.instantiationException(
+                  ResultRow.class,
+                  StringUtils.format("Unexpected token [%s] when reading string", token)
+              );
+            }
+            break;
+
+          case LONG:
+            retVal[i] = token == JsonToken.VALUE_NULL ? null : jp.getLongValue();
+            break;
+          case DOUBLE:
+            retVal[i] = token == JsonToken.VALUE_NULL ? null : jp.getDoubleValue();
+            break;
+          case FLOAT:
+            retVal[i] = token == JsonToken.VALUE_NULL ? null : jp.getFloatValue();
+            break;
+          case ARRAY:
+            if (types.get(i).equals(ColumnType.STRING_ARRAY)) {
+              final List<String> strings = new ArrayList<>();
+              while (jp.nextToken() != JsonToken.END_ARRAY) {
+                strings.add(jp.getText());
+              }
+              retVal[i] = ComparableStringArray.of(strings.toArray(new String[0]));
+              break;
+            }
+
+          default:
+            throw new ISE("Can't handle type [%s]", types.get(i).asTypeString());

Review comment:
       I'm a bit worried about this method. On the one hand it is nice because it preserves the types exactly, but ... what about complex types? they often serialize as base64 strings, but could register potentially anything with the mapper. And what about other types of arrays, including arrays of arrays? Additionally, type is still allowed to be null if it is unknown afaik.
   
   I'm also worried if there might be any performance implications here, this change should be measured I think if we really want to do it, but I can't help but wonder if something down the line should just be wrapping arrays to perform comparisons on them, lazily only when they need comparisons, instead of trying to do it here, and whether or not deserializing directly into `ComparableStringArray` will trip up other things down the line expecting to run into `List`
   
   

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java
##########
@@ -1357,6 +1373,12 @@ private RowBasedKeySerdeHelper makeSerdeHelper(
     )
     {
       switch (valueType.getType()) {
+        case ARRAY:
+          return new ArrayRowBasedKeySerdeHelper(

Review comment:
       this doesn't look like it handles all arrays, though using the `TypeStrategy` added in #11888 would maybe give the necessary comparators to handle the other types of arrays, but `arrayDictionary` would need to be able to hold any type of array, not just strings.

##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/expression/builtin/ArrayConstructorOperatorConversion.java
##########
@@ -45,6 +53,22 @@ public DruidExpression toDruidExpression(
       final RexNode rexNode
   )
   {
+    // Check if array needs to be unnested
+    if (plannerContext.getQueryContext()
+                      .getOrDefault(
+                          QueryContexts.ENABLE_UNNESTED_ARRAYS_KEY,
+                          QueryContexts.DEFAULT_ENABLE_UNNESTED_ARRAYS
+                      ).equals(Boolean.FALSE)) {
+      List<RexNode> nodes = ((RexCall) rexNode).getOperands();
+      Preconditions.checkArgument(
+          nodes == null || nodes.size() != 1,
+          "ARRAY[] should have exactly one argument"
+      );
+      if (nodes.get(0).getKind() == SqlKind.LITERAL) {
+        throw new UOE("ARRAY[] support for literals not implemented");
+      }
+      return Expressions.toDruidExpression(plannerContext, rowSignature, nodes.get(0));
+    }

Review comment:
       I don't think this needs a flag or any code change here if we allow arrays in SQL to remain arrays to the native layer (see other comment on consolidating `Calcites.getColumnTypeForRelDataType` and `Calcites.getValueTypeForRelDataTypeFull`)
   
   calling the array constructor on a multi-value column today doesn't work correctly anyway, because it actually attempts to map the array constructor function across each of the values of the multi-value string resulting in an a nested array of single element string arrays, instead of simply converting the multi-value string into a string array. So, changing that behavior doesn't seem problematic since I'm not sure the mapping is very intuitive.
   
   Additionally, i think if this flag did exist, and this flag was set, it would break all other uses of the array constructor function, which _do_ work today.

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
##########
@@ -386,13 +393,16 @@ public static boolean canPushDownLimit(ColumnSelectorFactory columnSelectorFacto
     @Override
     public GroupByColumnSelectorStrategy makeColumnSelectorStrategy(
         ColumnCapabilities capabilities,
-        ColumnValueSelector selector
+        ColumnValueSelector selector,
+        DimensionSpec dimensionSpec
     )
     {
       switch (capabilities.getType()) {
         case STRING:

Review comment:
       so, this PR doesn't _actually_ handle grouping on array typed columns yet, such as produced by virtual columns today with array functions, which is sort of a bummer (and part of why I'm pushing for having a more generic implementation of the grouping strategy in the first pass in a different comment). If you do plan to not do this yet, the PR title and description should be modified to indicate that it allows grouping on multi-value string 'STRING' typed columns as arrays or similar

##########
File path: processing/src/main/java/org/apache/druid/query/dimension/ColumnSelectorStrategyFactory.java
##########
@@ -24,5 +24,9 @@
 
 public interface ColumnSelectorStrategyFactory<ColumnSelectorStrategyClass extends ColumnSelectorStrategy>
 {
-  ColumnSelectorStrategyClass makeColumnSelectorStrategy(ColumnCapabilities capabilities, ColumnValueSelector selector);
+  ColumnSelectorStrategyClass makeColumnSelectorStrategy(
+      ColumnCapabilities capabilities,
+      ColumnValueSelector selector,
+      DimensionSpec dimensionSpec

Review comment:
       it seems sort of strange to pass a `DimensionSpec` into this method, maybe instead the caller should modify the `ColumnCapabilities` instead of pushing this in?

##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java
##########
@@ -403,7 +404,17 @@ private static Grouping computeGrouping(
       }
 
       final RelDataType dataType = rexNode.getType();
-      final ColumnType outputType = Calcites.getColumnTypeForRelDataType(dataType);
+      final ColumnType outputType;
+      if (plannerContext.getQueryContext()
+                        .getOrDefault(
+                            QueryContexts.ENABLE_UNNESTED_ARRAYS_KEY,
+                            QueryContexts.DEFAULT_ENABLE_UNNESTED_ARRAYS
+                        ).equals(Boolean.FALSE)) {
+        outputType = Calcites.getValueTypeForRelDataTypeFull(dataType);
+      } else {
+        outputType = Calcites.getColumnTypeForRelDataType(dataType);
+      }

Review comment:
       if grouping on arrays supported _all_ array types, the correct thing to do instead of a flag I think would to just be consolidate `getColumnTypeForRelDataType` and `getValueTypeForRelDataTypeFull` to remove the string coercion and let arrays stay as arrays. They are separate basically because we allow using array functions on string typed columns, but grouping on them requires they remain as string typed in the native layer, and if that were no longer true we wouldn't have to do this coercion any longer.

##########
File path: processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
##########
@@ -376,11 +373,37 @@ public static Float convertObjectToFloat(@Nullable Object valObj, boolean report
         return convertObjectToDouble(obj, reportParseExceptions);
       case STRING:
         return convertObjectToString(obj);
+      case ARRAY:
+        return convertToArray(obj);

Review comment:
       Does this change mean that anything that deals with arrays that go through here would need to know how to deal with `ComparableStringArray` instead of the actual arrays? Additionally, this method only seems to handle string arrays, it should either check that the element type is string and fail otherwise, or it should be a more generic type that can handle comparison of any type of array

##########
File path: processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryQueryToolChestTest.java
##########
@@ -573,7 +572,7 @@ public void testResultSerde() throws Exception
         .setQuerySegmentSpec(QueryRunnerTestHelper.FIRST_TO_THIRD)
         .setDimensions(Collections.singletonList(DefaultDimensionSpec.of("test")))
         .setAggregatorSpecs(Collections.singletonList(QueryRunnerTestHelper.ROWS_COUNT))
-        .setPostAggregatorSpecs(Collections.singletonList(new ConstantPostAggregator("post", 10)))
+        .setPostAggregatorSpecs(Collections.singletonList(new ConstantPostAggregator("post", 10.0)))

Review comment:
       i assume these changes are from the serializer change?

##########
File path: processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/column/ArrayGroupByColumnSelectorStrategy.java
##########
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.groupby.epinephelinae.column;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.query.groupby.ResultRow;
+import org.apache.druid.query.groupby.epinephelinae.Grouper;
+import org.apache.druid.query.ordering.StringComparator;
+import org.apache.druid.query.ordering.StringComparators;
+import org.apache.druid.segment.ColumnValueSelector;
+import org.apache.druid.segment.DimensionSelector;
+import org.apache.druid.segment.data.ComparableIntArray;
+import org.apache.druid.segment.data.ComparableStringArray;
+import org.apache.druid.segment.data.IndexedInts;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+public class ArrayGroupByColumnSelectorStrategy
+    implements GroupByColumnSelectorStrategy
+{
+  private static final int GROUP_BY_MISSING_VALUE = -1;
+
+
+  // contains string <-> id for each element of the multi value grouping column
+  // for eg : [a,b,c] is the col value. dictionaryToInt will contain { a <-> 1, b <-> 2, c <-> 3}
+  private final BiMap<String, Integer> dictionaryToInt;

Review comment:
       i wonder if maybe this a premature optimization? I think we should consider making a more generic implementation first, that just stores a lookup of int to full arrays, instead of arrays of ints, and preferably one that can handle _all_ types of arrays. With that in hand, it would be a lot easier to measure various other optimizations against that baseline.
   
   Not to say that this isn't the best approach to handling arrays of strings, it might very well be. But, I think there is also potentially a lot of overhead in this approach, since you'll need to be doing extra function calls, and saves the most memory footprint when there is low value cardinality, and to the largest extent only for arrays of variably sized types like strings. If all string values are unique then it is potentially worse than just storing the string arrays directly.
   
   we also don't actually seem to be taking advantage of the what I would think would be the real situation to use this approach, which is in the cases where an underlying multi-value string column is being used that is already dictionary encoded, we could use it's dictionary ids directly instead of making a new lookup here. In that special case, where grouping on a multi-value string column from a segment as an array, I can maybe see it being advantageous to just convert the `IndexedInts` to an array of ints directly, so you can defer ever looking up the string value for the dictionary id until it is needed. But, that _only_ works when the underlying type is dictionary encoded, which isn't always the case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org