You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "imply-cheddar (via GitHub)" <gi...@apache.org> on 2023/03/27 04:16:21 UTC
[GitHub] [druid] imply-cheddar commented on a diff in pull request #13977: smarter nested column index utilization

imply-cheddar commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148747811


##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
 public interface ColumnConfig
 {
   int columnCacheSizeBytes();
+
+  /**
+   * If the total number of rows in a column multiplied by this value is smaller than the total number of bitmap
+   * index operations required to perform to use a {@link LexicographicalRangeIndex} or {@link NumericRangeIndex},
+   * then for any {@link ColumnIndexSupplier} which chooses to participate in this config it will skip computing the
+   * index, in favor of doing a full scan and using a {@link org.apache.druid.query.filter.ValueMatcher} instead.
+   * This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
+   * been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the

Review Comment:
   I'm scared of exposing this behavior through adjusting the return of `.as()`.  I think it would be better to have the columns continue to return a meaningful object for `.as()` but have the those interfaces that benefit from this be able to opt themselves out of being used for bitmaps.  (Or, in some future world, return an equivalent `ValueMatcher`)



##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -530,6 +533,10 @@ private ColumnHolder readNestedFieldColumn(String field)
       columnBuilder.setHasMultipleValues(false)
                    .setHasNulls(hasNull)
                    .setDictionaryEncodedColumnSupplier(columnSupplier);
+
+      ColumnarInts intsColumn = ints.get();
+      final int size = intsColumn.size();
+      intsColumn.close();

Review Comment:
   This type of pattern is a bit scary.  In this instance given that it's when deserializing the object, it's probably fine, but we've had issues where we thought we were getting a thing and then closing it, only to find out that we were getting a thing that was cached and supposed to live for the lifetime of a query and the immediate close was closing things.
   
   That said, I think we can make this cleaner by making the `Supplier<>` itself be a class that has a method that knows the number of rows represented by the thing that it supplies.  It should be easy enough to implement this for the suppliers used in this case (the number is already on the jvm heap for those objects anyway) and this avoids any need to create an object and close it immediately.



##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
 public interface ColumnConfig
 {
   int columnCacheSizeBytes();
+
+  /**
+   * If the total number of rows in a column multiplied by this value is smaller than the total number of bitmap
+   * index operations required to perform to use a {@link LexicographicalRangeIndex} or {@link NumericRangeIndex},
+   * then for any {@link ColumnIndexSupplier} which chooses to participate in this config it will skip computing the
+   * index, in favor of doing a full scan and using a {@link org.apache.druid.query.filter.ValueMatcher} instead.
+   * This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
+   * been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the
+   * number of bitmap operations is determined by how many individual values fall in the range, a subset of the columns
+   * total cardinality.
+   * <p>
+   * Currently only the {@link org.apache.druid.segment.nested.NestedFieldLiteralColumnIndexSupplier} implements this
+   * behavior.
+   * <p>
+   * This can make many standalone filters faster in cases where the overhead of doing bitmap operations to construct a
+   * {@link org.apache.druid.segment.BitmapOffset} or {@link org.apache.druid.segment.vector.BitmapVectorOffset} can
+   * often exceed the cost of just using doing a full scan and using a
+   * {@link org.apache.druid.query.filter.ValueMatcher}.

Review Comment:
   I think the "many" in "This can make many standalone filters faster" is a bit of a hyperbola.  What about "Some standalone filters speed up with this optimization in cases where the overhead of walking the dictionary and combining bitmaps to construct ..."?



##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -538,7 +545,10 @@ private ColumnHolder readNestedFieldColumn(String field)
               localDictionarySupplier,
               stringDictionarySupplier,
               longDictionarySupplier,
-              doubleDictionarySupplier
+              doubleDictionarySupplier,
+              size,
+              columnConfig.skipValueRangeIndexScale(),
+              columnConfig.skipValuePredicateIndexScale()

Review Comment:
   Why not just pass the `columnConfig` object itself?



##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlNestedDataBenchmark.java:
##########
@@ -173,7 +173,31 @@ public String getFormatString()
       "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
       // 28, 29
       "SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0",
-      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0"
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 30, 31
+      "SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 32,33
+      "SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 34,35 smaller cardinality like range filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
+      // 36,37 smaller cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
+      // 38-39 moderate cardinality like range
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
+      // 40, 41 big cardinality lex range
+      "SELECT SUM(long1) FROM foo WHERE string5 > '1'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
+      // 42, 43 big cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",

Review Comment:
   Can you add various sized IN clauses to your benchmark?  10, 100, 1000, 10000, 100000?
   
   After reading the PR, I don't think they should be impacted, but it would be good to know that they aren't as they are an example of a thing that falls back to predicate-like behavior, but which we have seen also cause performance issues when run as a `ValueMatcher`



##########
processing/src/main/java/org/apache/druid/segment/nested/NestedFieldLiteralColumnIndexSupplier.java:
##########
@@ -124,6 +131,10 @@ public <T> T as(Class<T> clazz)
       return (T) new NestedLiteralDictionaryEncodedStringValueIndex();
     }
 
+    if (skipPredicateIndex && clazz.equals(DruidPredicateIndex.class)) {
+      return null;
+    }

Review Comment:
   Technically, this would indicate that this column cannot implement the given the interface, so the filter in question should fall back to another interface (i.e. it might make the filter to cascade down into trying an even worse strategy).  Instead, we should return a `DruidPredicateIndex` that returns `null` when asked to compute the index, indicating that it was unable to compute the index.  This will make it easier to do this "correctly" when we make the change to have this object able to also return ValueMatcher objects.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org