You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "clintropolis (via GitHub)" <gi...@apache.org> on 2023/03/25 06:04:43 UTC

[GitHub] [druid] clintropolis opened a new pull request, #13977: smarter nested column index utilization

clintropolis opened a new pull request, #13977:
URL: https://github.com/apache/druid/pull/13977

   ### Description
   A sort of follow-up to #12830, which added `NumericRangeIndex` to improve the performance of nested column bound filters on numeric columns, but noted that there were still cases where the cost of bitmap operations overran the cost of just doing a full scan and using value matchers. This is also more or less the same thing described in #3878.
   
   This PR attempts to improve this situation by adding cardinality based thresholds for range and predicate indexes to choose to skip using bitmap indexes.
   
   changes:
   * adds `skipValueRangeIndexScale` and `skipValuePredicateIndexScale` to `ColumnConfig` (e.g. `DruidProcessingConfig`) available as system config via `druid.processing.indexes.skipValueRangeIndexScale` and `druid.processing.indexes.skipValuePredicateIndexScale`
   * `NestedColumnIndexSupplier` uses `skipValueRangeIndexScale` and `skipValuePredicateIndexScale` to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations
   * Default values for `skipValueRangeIndexScale` and `skipValuePredicateIndexScale` have been initially set to 0.08, but are separate to allow independent tuning
   * these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values
   * these changes provide a pretty sweet performance boost for filter processing on nested columns
   
   We might want to consider making these same changes to `DictionaryEncodedStringIndexSupplier`, but I have not done so at this time.
   
   #### Benchmarks comparing to regular columns
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 10000.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 10000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       14           5000000              none        false  avgt    5  102.947 ±   0.875  ms/op
   SqlNestedDataBenchmark.querySql       14           5000000              none        force  avgt    5   63.557 ±   0.477  ms/op
   SqlNestedDataBenchmark.querySql       15           5000000              none        false  avgt    5  122.576 ±   0.919  ms/op
   SqlNestedDataBenchmark.querySql       15           5000000              none        force  avgt    5   66.158 ±   0.667  ms/op
   ```
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 1005.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       26           5000000              none        false  avgt    5   89.212 ±   4.634  ms/op
   SqlNestedDataBenchmark.querySql       26           5000000              none        force  avgt    5   60.594 ±   3.423  ms/op
   SqlNestedDataBenchmark.querySql       27           5000000              none        false  avgt    5   19.526 ±   0.543  ms/op
   SqlNestedDataBenchmark.querySql       27           5000000              none        force  avgt    5   23.740 ±   0.404  ms/op
   ```
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       28           5000000              none        false  avgt    5  113.113 ±  14.400  ms/op
   SqlNestedDataBenchmark.querySql       28           5000000              none        force  avgt    5   62.150 ±   0.211  ms/op
   SqlNestedDataBenchmark.querySql       29           5000000              none        false  avgt    5   60.687 ±   0.549  ms/op
   SqlNestedDataBenchmark.querySql       29           5000000              none        force  avgt    5   50.789 ±   0.935  ms/op
   ```
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       30           5000000              none        false  avgt    5  104.038 ±   1.316  ms/op
   SqlNestedDataBenchmark.querySql       30           5000000              none        force  avgt    5   62.930 ±   0.327  ms/op
   SqlNestedDataBenchmark.querySql       31           5000000              none        false  avgt    5  126.925 ±   1.966  ms/op
   SqlNestedDataBenchmark.querySql       31           5000000              none        force  avgt    5   67.254 ±   2.807  ms/op
   ```
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       32           5000000              none        false  avgt    5  111.034 ±   1.047  ms/op
   SqlNestedDataBenchmark.querySql       32           5000000              none        force  avgt    5   73.748 ±   0.749  ms/op
   SqlNestedDataBenchmark.querySql       33           5000000              none        false  avgt    5  116.868 ±   2.659  ms/op
   SqlNestedDataBenchmark.querySql       33           5000000              none        force  avgt    5   75.269 ±   0.208  ms/op
   ```
   
   ```
   34,35 smaller cardinality like range filter
   SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       34           5000000              none        false  avgt    5   34.077 ±   0.365  ms/op
   SqlNestedDataBenchmark.querySql       34           5000000              none        force  avgt    5   25.754 ±   0.384  ms/op
   SqlNestedDataBenchmark.querySql       35           5000000              none        false  avgt    5   59.226 ±   0.300  ms/op
   SqlNestedDataBenchmark.querySql       35           5000000              none        force  avgt    5   29.900 ±   0.301  ms/op
   ```
   
   ```
   36,37 smaller cardinality like predicate filter
   SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       36           5000000              none        false  avgt    5   81.823 ±   0.609  ms/op
   SqlNestedDataBenchmark.querySql       36           5000000              none        force  avgt    5   40.584 ±   0.703  ms/op
   SqlNestedDataBenchmark.querySql       37           5000000              none        false  avgt    5   80.930 ±   4.364  ms/op
   SqlNestedDataBenchmark.querySql       37           5000000              none        force  avgt    5   37.533 ±   0.464  ms/op
   ```
   
   ```
   38, 39 moderate cardinality like range
   SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       38           5000000              none        false  avgt    5   77.782 ±   0.898  ms/op
   SqlNestedDataBenchmark.querySql       38           5000000              none        force  avgt    5   44.718 ±   0.882  ms/op
   SqlNestedDataBenchmark.querySql       39           5000000              none        false  avgt    5   68.806 ±   0.952  ms/op
   SqlNestedDataBenchmark.querySql       39           5000000              none        force  avgt    5   50.848 ±   0.975  ms/op
   ```
   
   ```
   40, 41 big cardinality lex range
   SELECT SUM(long1) FROM foo WHERE string5 > '1'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       40           5000000              none        false  avgt    5  245.087 ±   1.706  ms/op
   SqlNestedDataBenchmark.querySql       40           5000000              none        force  avgt    5  222.619 ±   3.206  ms/op
   SqlNestedDataBenchmark.querySql       41           5000000              none        false  avgt    5  354.551 ±   8.115  ms/op
   SqlNestedDataBenchmark.querySql       41           5000000              none        force  avgt    5  211.717 ±   9.943  ms/op
   ```
   
   ```
   42, 43 big cardinality like predicate filter
   SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       42           5000000              none        false  avgt    5  796.396 ± 119.324  ms/op
   SqlNestedDataBenchmark.querySql       42           5000000              none        force  avgt    5  648.279 ±   4.008  ms/op
   SqlNestedDataBenchmark.querySql       43           5000000              none        false  avgt    5  504.493 ±  92.658  ms/op
   SqlNestedDataBenchmark.querySql       43           5000000              none        force  avgt    5  383.268 ±  14.283  ms/op
   ```
   
   ```
   44, 45 big cardinality like filter + selector filter
   SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%' AND string1 = '1000'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%' AND JSON_VALUE(nested, '$.nesteder.string1') = '1000'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       44           5000000              none        false  avgt    5  641.740 ±  13.625  ms/op
   SqlNestedDataBenchmark.querySql       44           5000000              none        force  avgt    5  598.289 ±   9.617  ms/op
   SqlNestedDataBenchmark.querySql       45           5000000              none        false  avgt    5   21.523 ±   0.254  ms/op
   SqlNestedDataBenchmark.querySql       45           5000000              none        force  avgt    5   19.543 ±   0.231  ms/op
   ```
   
   #### Release note
   <!-- Give your best effort to summarize your changes in a couple of sentences aimed toward Druid users. 
   
   If your change doesn't have end user impact, you can skip this section.
   
   For tips about how to write a good release note, see [Release notes](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#release-notes).
   
   -->
   
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `ColumnConfig`
    * `NestedFieldLiteralColumnIndexSupplier`
   
   <hr>
   
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #13977: smarter nested column index utilization

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #13977:
URL: https://github.com/apache/druid/pull/13977#issuecomment-1483747734

   the different test results seem to be because of different behavior of the predicate index on the bound filter vs the value matcher, the latter looks like it is matching nulls. The results now actually look more consistent with other numeric filters using predicates due to lexicographic sorting.
   
   I think there is a bug with the bound filter, it needs to wrap the predicate index in the null matching lower bound in the same way that the range index does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] github-code-scanning[bot] commented on a diff in pull request #13977: smarter nested column index utilization

Posted by "github-code-scanning[bot] (via GitHub)" <gi...@apache.org>.
github-code-scanning[bot] commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1158051604


##########
processing/src/main/java/org/apache/druid/segment/AutoTypeColumnMerger.java:
##########
@@ -87,15 +86,14 @@
       String name,
       IndexSpec indexSpec,
       SegmentWriteOutMedium segmentWriteOutMedium,
-      ProgressIndicator progressIndicator,
+      @SuppressWarnings("unused") ProgressIndicator progressIndicator,

Review Comment:
   ## Useless parameter
   
   The parameter 'progressIndicator' is never used.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4727)



##########
processing/src/main/java/org/apache/druid/segment/nested/VariantArrayColumnAndIndexSupplier.java:
##########
@@ -199,7 +207,9 @@
       Supplier<FrontCodedIntArrayIndexed> arrayDictionarySupplier,
       Supplier<ColumnarInts> encodedValueColumnSupplier,
       GenericIndexed<ImmutableBitmap> valueIndexes,
-      @SuppressWarnings("unused") BitmapFactory bitmapFactory
+      @SuppressWarnings("unused") BitmapFactory bitmapFactory,
+      @SuppressWarnings("unused") ColumnConfig columnConfig,
+      @SuppressWarnings("unused") int numRows

Review Comment:
   ## Useless parameter
   
   The parameter 'numRows' is never used.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4729)



##########
processing/src/main/java/org/apache/druid/segment/nested/VariantArrayColumnAndIndexSupplier.java:
##########
@@ -199,7 +207,9 @@
       Supplier<FrontCodedIntArrayIndexed> arrayDictionarySupplier,
       Supplier<ColumnarInts> encodedValueColumnSupplier,
       GenericIndexed<ImmutableBitmap> valueIndexes,
-      @SuppressWarnings("unused") BitmapFactory bitmapFactory
+      @SuppressWarnings("unused") BitmapFactory bitmapFactory,
+      @SuppressWarnings("unused") ColumnConfig columnConfig,

Review Comment:
   ## Useless parameter
   
   The parameter 'columnConfig' is never used.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4728)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #13977: smarter nested column index utilization

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #13977:
URL: https://github.com/apache/druid/pull/13977#issuecomment-1483754750

   >I think there is a bug with the bound filter, it needs to wrap the predicate index in the null matching lower bound in the same way that the range index does.
   
   Actually the mismatch is between the double predicate being used for the `DruidPredicateIndex` of the nested column, while the value matcher is using the string predicate. I'm not entirely certain how to fix this yet, since the virtual column is planned with an output type of 'string', while internally the nested field column is typed 'double' and so uses the double predicate. The predicate index might need to further specify which predicate should be used instead of just supplying the factory so that I can make this better behaved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on a diff in pull request #13977: smarter nested column index utilization

Posted by "imply-cheddar (via GitHub)" <gi...@apache.org>.
imply-cheddar commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148747811


##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
 public interface ColumnConfig
 {
   int columnCacheSizeBytes();
+
+  /**
+   * If the total number of rows in a column multiplied by this value is smaller than the total number of bitmap
+   * index operations required to perform to use a {@link LexicographicalRangeIndex} or {@link NumericRangeIndex},
+   * then for any {@link ColumnIndexSupplier} which chooses to participate in this config it will skip computing the
+   * index, in favor of doing a full scan and using a {@link org.apache.druid.query.filter.ValueMatcher} instead.
+   * This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
+   * been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the

Review Comment:
   I'm scared of exposing this behavior through adjusting the return of `.as()`.  I think it would be better to have the columns continue to return a meaningful object for `.as()` but have the those interfaces that benefit from this be able to opt themselves out of being used for bitmaps.  (Or, in some future world, return an equivalent `ValueMatcher`)



##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -530,6 +533,10 @@ private ColumnHolder readNestedFieldColumn(String field)
       columnBuilder.setHasMultipleValues(false)
                    .setHasNulls(hasNull)
                    .setDictionaryEncodedColumnSupplier(columnSupplier);
+
+      ColumnarInts intsColumn = ints.get();
+      final int size = intsColumn.size();
+      intsColumn.close();

Review Comment:
   This type of pattern is a bit scary.  In this instance given that it's when deserializing the object, it's probably fine, but we've had issues where we thought we were getting a thing and then closing it, only to find out that we were getting a thing that was cached and supposed to live for the lifetime of a query and the immediate close was closing things.
   
   That said, I think we can make this cleaner by making the `Supplier<>` itself be a class that has a method that knows the number of rows represented by the thing that it supplies.  It should be easy enough to implement this for the suppliers used in this case (the number is already on the jvm heap for those objects anyway) and this avoids any need to create an object and close it immediately.



##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
 public interface ColumnConfig
 {
   int columnCacheSizeBytes();
+
+  /**
+   * If the total number of rows in a column multiplied by this value is smaller than the total number of bitmap
+   * index operations required to perform to use a {@link LexicographicalRangeIndex} or {@link NumericRangeIndex},
+   * then for any {@link ColumnIndexSupplier} which chooses to participate in this config it will skip computing the
+   * index, in favor of doing a full scan and using a {@link org.apache.druid.query.filter.ValueMatcher} instead.
+   * This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
+   * been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the
+   * number of bitmap operations is determined by how many individual values fall in the range, a subset of the columns
+   * total cardinality.
+   * <p>
+   * Currently only the {@link org.apache.druid.segment.nested.NestedFieldLiteralColumnIndexSupplier} implements this
+   * behavior.
+   * <p>
+   * This can make many standalone filters faster in cases where the overhead of doing bitmap operations to construct a
+   * {@link org.apache.druid.segment.BitmapOffset} or {@link org.apache.druid.segment.vector.BitmapVectorOffset} can
+   * often exceed the cost of just using doing a full scan and using a
+   * {@link org.apache.druid.query.filter.ValueMatcher}.

Review Comment:
   I think the "many" in "This can make many standalone filters faster" is a bit of a hyperbola.  What about "Some standalone filters speed up with this optimization in cases where the overhead of walking the dictionary and combining bitmaps to construct ..."?



##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -538,7 +545,10 @@ private ColumnHolder readNestedFieldColumn(String field)
               localDictionarySupplier,
               stringDictionarySupplier,
               longDictionarySupplier,
-              doubleDictionarySupplier
+              doubleDictionarySupplier,
+              size,
+              columnConfig.skipValueRangeIndexScale(),
+              columnConfig.skipValuePredicateIndexScale()

Review Comment:
   Why not just pass the `columnConfig` object itself?



##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlNestedDataBenchmark.java:
##########
@@ -173,7 +173,31 @@ public String getFormatString()
       "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
       // 28, 29
       "SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0",
-      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0"
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 30, 31
+      "SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 32,33
+      "SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 34,35 smaller cardinality like range filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
+      // 36,37 smaller cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
+      // 38-39 moderate cardinality like range
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
+      // 40, 41 big cardinality lex range
+      "SELECT SUM(long1) FROM foo WHERE string5 > '1'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
+      // 42, 43 big cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",

Review Comment:
   Can you add various sized IN clauses to your benchmark?  10, 100, 1000, 10000, 100000?
   
   After reading the PR, I don't think they should be impacted, but it would be good to know that they aren't as they are an example of a thing that falls back to predicate-like behavior, but which we have seen also cause performance issues when run as a `ValueMatcher`



##########
processing/src/main/java/org/apache/druid/segment/nested/NestedFieldLiteralColumnIndexSupplier.java:
##########
@@ -124,6 +131,10 @@ public <T> T as(Class<T> clazz)
       return (T) new NestedLiteralDictionaryEncodedStringValueIndex();
     }
 
+    if (skipPredicateIndex && clazz.equals(DruidPredicateIndex.class)) {
+      return null;
+    }

Review Comment:
   Technically, this would indicate that this column cannot implement the given the interface, so the filter in question should fall back to another interface (i.e. it might make the filter to cascade down into trying an even worse strategy).  Instead, we should return a `DruidPredicateIndex` that returns `null` when asked to compute the index, indicating that it was unable to compute the index.  This will make it easier to do this "correctly" when we make the change to have this object able to also return ValueMatcher objects.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on a diff in pull request #13977: smarter nested column index utilization

Posted by "imply-cheddar (via GitHub)" <gi...@apache.org>.
imply-cheddar commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1158089484


##########
processing/src/main/java/org/apache/druid/segment/nested/NestedFieldColumnIndexSupplier.java:
##########
@@ -119,8 +121,8 @@ public NestedFieldColumnIndexSupplier(
     this.arrayElementBitmaps = arrayElementBitmaps;
     this.adjustLongId = globalStringDictionarySupplier.get().size();
     this.adjustDoubleId = adjustLongId + globalLongDictionarySupplier.get().size();
-    this.skipRangeIndexThreshold = (int) Math.ceil(skipValueRangeIndexScale * numRows);
-    this.skipPredicateIndex = localDictionarySupplier.get().size() > Math.ceil(skipValuePredicateIndexScale * numRows);
+    this.skipRangeIndexThreshold = (int) Math.ceil(columnConfig.skipValueRangeIndexScale() * numRows);
+    this.skipPredicateIndex = localDictionarySupplier.get().size() > Math.ceil(columnConfig.skipValuePredicateIndexScale() * numRows);

Review Comment:
   I think it's a bit nicer to make these methods instead of fields that are computed on every constructor.  The computation that's being done is minimal, that's true, but it's better to only do it when needed.  I think that if you just store the reference to the `ColumnConfig` you can pretty easily move this check into private methods.  All of the places where you would check for skipping the predicate index, you would probably also be grabbing a `localDictionarySupplier` anyway, so you can reuse that singular reference too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #13977: smarter nested column index utilization

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148822590


##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlNestedDataBenchmark.java:
##########
@@ -173,7 +173,31 @@ public String getFormatString()
       "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
       // 28, 29
       "SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0",
-      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0"
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 30, 31
+      "SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 32,33
+      "SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 34,35 smaller cardinality like range filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
+      // 36,37 smaller cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
+      // 38-39 moderate cardinality like range
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
+      // 40, 41 big cardinality lex range
+      "SELECT SUM(long1) FROM foo WHERE string5 > '1'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
+      // 42, 43 big cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",

Review Comment:
   `InDimFilter` typically only uses `Utf8ValueSetIndex` and `StringValueSetIndex` which cannot be impacted by the changes currently in this PR, _unless_ it is using an `ExtractionFn` which causes it to use a `DruidPredicateIndex`. I need to figure out how to get something that would plan to something with an extractionFn in here to see how this PR would impact this filter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #13977: smarter nested column index utilization

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #13977:
URL: https://github.com/apache/druid/pull/13977#issuecomment-1483742354

   I decided to do the `DruidPredicateIndex` threshold based on the overall value cardinality instead of evaluating each predicate on the assumption that the predicates themselves are not that cheap. However, there might be room for exploration of evaluating the predicates under a certain dictionary sizes to determine the actual number of bitmap operations like we are doing for range indexes (well the `LexicographicalRangeIndex` for strings has a predicate form that further reduces the actual number of bitmaps, but that they operate on the full range before the predicate).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #13977: smarter nested column index utilization

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148825560


##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
 public interface ColumnConfig
 {
   int columnCacheSizeBytes();
+
+  /**
+   * If the total number of rows in a column multiplied by this value is smaller than the total number of bitmap
+   * index operations required to perform to use a {@link LexicographicalRangeIndex} or {@link NumericRangeIndex},
+   * then for any {@link ColumnIndexSupplier} which chooses to participate in this config it will skip computing the
+   * index, in favor of doing a full scan and using a {@link org.apache.druid.query.filter.ValueMatcher} instead.
+   * This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
+   * been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the

Review Comment:
   This was already true of `as` itself, returning `null` has always been an indicator that the index we asked for is not available. I just made it a bit clearer in the docs and also leaned into allowing the things `as` returns to also return null to indicate that an index is not available.
   
   For the types of short circuits done in this PR, i feel like the index supplier has better information to make the decision than anything else above it since it has direct access to the size of dictionary, number of rows to be scanned, etc, but I will think on it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on a diff in pull request #13977: smarter nested column index utilization

Posted by "imply-cheddar (via GitHub)" <gi...@apache.org>.
imply-cheddar commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1159213704


##########
processing/src/main/java/org/apache/druid/segment/column/DruidPredicateIndex.java:
##########
@@ -36,4 +36,12 @@
    */
   @Nullable
   BitmapColumnIndex forPredicate(DruidPredicateFactory matcherFactory);
+
+  static boolean checkSkipThreshold(@Nullable ColumnConfig columnConfig, int numRowsToScan, int dictionaryCardinality)

Review Comment:
   `shouldSkip`?  `checkSkipThreshold` doesn't really tell me how I'm supposed to interpret `true/false`.  Does `true` mean that it exceeded the threshold and therefore should be skipped?  Or does it mean that it did not exceed the threshold and should be used?



##########
processing/src/main/java/org/apache/druid/segment/column/NumericRangeIndex.java:
##########
@@ -43,4 +43,9 @@ BitmapColumnIndex forRange(
       @Nullable Number endValue,
       boolean endStrict
   );
+
+  static boolean checkSkipThreshold(ColumnConfig columnConfig, int numRowsToScan, int rangeSize)
+  {
+    return rangeSize > (int) Math.ceil(columnConfig.skipValueRangeIndexScale() * numRowsToScan);

Review Comment:
   You aren't doing the null check on this one.  That intentional?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #13977: smarter nested column index utilization

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis merged PR #13977:
URL: https://github.com/apache/druid/pull/13977


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #13977: smarter nested column index utilization

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148822590


##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlNestedDataBenchmark.java:
##########
@@ -173,7 +173,31 @@ public String getFormatString()
       "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
       // 28, 29
       "SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0",
-      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0"
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 30, 31
+      "SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 32,33
+      "SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 34,35 smaller cardinality like range filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
+      // 36,37 smaller cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
+      // 38-39 moderate cardinality like range
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
+      // 40, 41 big cardinality lex range
+      "SELECT SUM(long1) FROM foo WHERE string5 > '1'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
+      // 42, 43 big cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",

Review Comment:
   `InDimFilter` typically only uses `Utf8ValueSetIndex` and `StringValueSetIndex` which cannot be impacted by the changes currently in this PR, _unless_ it is using an `ExtractionFn`. I need to figure out how to get something that would plan to something with an extractionFn in here to see how this PR would impact this filter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #13977: smarter nested column index utilization

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #13977:
URL: https://github.com/apache/druid/pull/13977#issuecomment-1483743051

   Also looks like I might have some legitimate test failures, possibly due to mismatch of behavior between indexes and value matchers on some nested columns 😓 investigating


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org