You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "clintropolis (via GitHub)" <gi...@apache.org> on 2023/03/25 06:04:43 UTC
[GitHub] [druid] clintropolis opened a new pull request, #13977: smarter nested column index utilization
clintropolis opened a new pull request, #13977:
URL: https://github.com/apache/druid/pull/13977
### Description
A sort of follow-up to #12830, which added `NumericRangeIndex` to improve the performance of nested column bound filters on numeric columns, but noted that there were still cases where the cost of bitmap operations overran the cost of just doing a full scan and using value matchers. This is also more or less the same thing described in #3878.
This PR attempts to improve this situation by adding cardinality based thresholds for range and predicate indexes to choose to skip using bitmap indexes.
changes:
* adds `skipValueRangeIndexScale` and `skipValuePredicateIndexScale` to `ColumnConfig` (e.g. `DruidProcessingConfig`) available as system config via `druid.processing.indexes.skipValueRangeIndexScale` and `druid.processing.indexes.skipValuePredicateIndexScale`
* `NestedColumnIndexSupplier` uses `skipValueRangeIndexScale` and `skipValuePredicateIndexScale` to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations
* Default values for `skipValueRangeIndexScale` and `skipValuePredicateIndexScale` have been initially set to 0.08, but are separate to allow independent tuning
* these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values
* these changes provide a pretty sweet performance boost for filter processing on nested columns
We might want to consider making these same changes to `DictionaryEncodedStringIndexSupplier`, but I have not done so at this time.
#### Benchmarks comparing to regular columns
```
SELECT SUM(long1) FROM foo WHERE double3 < 10000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 10000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 14 5000000 none false avgt 5 102.947 ± 0.875 ms/op
SqlNestedDataBenchmark.querySql 14 5000000 none force avgt 5 63.557 ± 0.477 ms/op
SqlNestedDataBenchmark.querySql 15 5000000 none false avgt 5 122.576 ± 0.919 ms/op
SqlNestedDataBenchmark.querySql 15 5000000 none force avgt 5 66.158 ± 0.667 ms/op
```
```
SELECT SUM(long1) FROM foo WHERE double3 < 1005.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 26 5000000 none false avgt 5 89.212 ± 4.634 ms/op
SqlNestedDataBenchmark.querySql 26 5000000 none force avgt 5 60.594 ± 3.423 ms/op
SqlNestedDataBenchmark.querySql 27 5000000 none false avgt 5 19.526 ± 0.543 ms/op
SqlNestedDataBenchmark.querySql 27 5000000 none force avgt 5 23.740 ± 0.404 ms/op
```
```
SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 28 5000000 none false avgt 5 113.113 ± 14.400 ms/op
SqlNestedDataBenchmark.querySql 28 5000000 none force avgt 5 62.150 ± 0.211 ms/op
SqlNestedDataBenchmark.querySql 29 5000000 none false avgt 5 60.687 ± 0.549 ms/op
SqlNestedDataBenchmark.querySql 29 5000000 none force avgt 5 50.789 ± 0.935 ms/op
```
```
SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 30 5000000 none false avgt 5 104.038 ± 1.316 ms/op
SqlNestedDataBenchmark.querySql 30 5000000 none force avgt 5 62.930 ± 0.327 ms/op
SqlNestedDataBenchmark.querySql 31 5000000 none false avgt 5 126.925 ± 1.966 ms/op
SqlNestedDataBenchmark.querySql 31 5000000 none force avgt 5 67.254 ± 2.807 ms/op
```
```
SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 32 5000000 none false avgt 5 111.034 ± 1.047 ms/op
SqlNestedDataBenchmark.querySql 32 5000000 none force avgt 5 73.748 ± 0.749 ms/op
SqlNestedDataBenchmark.querySql 33 5000000 none false avgt 5 116.868 ± 2.659 ms/op
SqlNestedDataBenchmark.querySql 33 5000000 none force avgt 5 75.269 ± 0.208 ms/op
```
```
34,35 smaller cardinality like range filter
SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 34 5000000 none false avgt 5 34.077 ± 0.365 ms/op
SqlNestedDataBenchmark.querySql 34 5000000 none force avgt 5 25.754 ± 0.384 ms/op
SqlNestedDataBenchmark.querySql 35 5000000 none false avgt 5 59.226 ± 0.300 ms/op
SqlNestedDataBenchmark.querySql 35 5000000 none force avgt 5 29.900 ± 0.301 ms/op
```
```
36,37 smaller cardinality like predicate filter
SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 36 5000000 none false avgt 5 81.823 ± 0.609 ms/op
SqlNestedDataBenchmark.querySql 36 5000000 none force avgt 5 40.584 ± 0.703 ms/op
SqlNestedDataBenchmark.querySql 37 5000000 none false avgt 5 80.930 ± 4.364 ms/op
SqlNestedDataBenchmark.querySql 37 5000000 none force avgt 5 37.533 ± 0.464 ms/op
```
```
38, 39 moderate cardinality like range
SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 38 5000000 none false avgt 5 77.782 ± 0.898 ms/op
SqlNestedDataBenchmark.querySql 38 5000000 none force avgt 5 44.718 ± 0.882 ms/op
SqlNestedDataBenchmark.querySql 39 5000000 none false avgt 5 68.806 ± 0.952 ms/op
SqlNestedDataBenchmark.querySql 39 5000000 none force avgt 5 50.848 ± 0.975 ms/op
```
```
40, 41 big cardinality lex range
SELECT SUM(long1) FROM foo WHERE string5 > '1'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 40 5000000 none false avgt 5 245.087 ± 1.706 ms/op
SqlNestedDataBenchmark.querySql 40 5000000 none force avgt 5 222.619 ± 3.206 ms/op
SqlNestedDataBenchmark.querySql 41 5000000 none false avgt 5 354.551 ± 8.115 ms/op
SqlNestedDataBenchmark.querySql 41 5000000 none force avgt 5 211.717 ± 9.943 ms/op
```
```
42, 43 big cardinality like predicate filter
SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 42 5000000 none false avgt 5 796.396 ± 119.324 ms/op
SqlNestedDataBenchmark.querySql 42 5000000 none force avgt 5 648.279 ± 4.008 ms/op
SqlNestedDataBenchmark.querySql 43 5000000 none false avgt 5 504.493 ± 92.658 ms/op
SqlNestedDataBenchmark.querySql 43 5000000 none force avgt 5 383.268 ± 14.283 ms/op
```
```
44, 45 big cardinality like filter + selector filter
SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%' AND string1 = '1000'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%' AND JSON_VALUE(nested, '$.nesteder.string1') = '1000'
Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 44 5000000 none false avgt 5 641.740 ± 13.625 ms/op
SqlNestedDataBenchmark.querySql 44 5000000 none force avgt 5 598.289 ± 9.617 ms/op
SqlNestedDataBenchmark.querySql 45 5000000 none false avgt 5 21.523 ± 0.254 ms/op
SqlNestedDataBenchmark.querySql 45 5000000 none force avgt 5 19.543 ± 0.231 ms/op
```
#### Release note
<!-- Give your best effort to summarize your changes in a couple of sentences aimed toward Druid users.
If your change doesn't have end user impact, you can skip this section.
For tips about how to write a good release note, see [Release notes](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#release-notes).
-->
<hr>
##### Key changed/added classes in this PR
* `ColumnConfig`
* `NestedFieldLiteralColumnIndexSupplier`
<hr>
This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis commented on pull request #13977: smarter nested column index utilization
Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #13977:
URL: https://github.com/apache/druid/pull/13977#issuecomment-1483747734
the different test results seem to be because of different behavior of the predicate index on the bound filter vs the value matcher, the latter looks like it is matching nulls. The results now actually look more consistent with other numeric filters using predicates due to lexicographic sorting.
I think there is a bug with the bound filter, it needs to wrap the predicate index in the null matching lower bound in the same way that the range index does.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] github-code-scanning[bot] commented on a diff in pull request #13977: smarter nested column index utilization
Posted by "github-code-scanning[bot] (via GitHub)" <gi...@apache.org>.
github-code-scanning[bot] commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1158051604
##########
processing/src/main/java/org/apache/druid/segment/AutoTypeColumnMerger.java:
##########
@@ -87,15 +86,14 @@
String name,
IndexSpec indexSpec,
SegmentWriteOutMedium segmentWriteOutMedium,
- ProgressIndicator progressIndicator,
+ @SuppressWarnings("unused") ProgressIndicator progressIndicator,
Review Comment:
## Useless parameter
The parameter 'progressIndicator' is never used.
[Show more details](https://github.com/apache/druid/security/code-scanning/4727)
##########
processing/src/main/java/org/apache/druid/segment/nested/VariantArrayColumnAndIndexSupplier.java:
##########
@@ -199,7 +207,9 @@
Supplier<FrontCodedIntArrayIndexed> arrayDictionarySupplier,
Supplier<ColumnarInts> encodedValueColumnSupplier,
GenericIndexed<ImmutableBitmap> valueIndexes,
- @SuppressWarnings("unused") BitmapFactory bitmapFactory
+ @SuppressWarnings("unused") BitmapFactory bitmapFactory,
+ @SuppressWarnings("unused") ColumnConfig columnConfig,
+ @SuppressWarnings("unused") int numRows
Review Comment:
## Useless parameter
The parameter 'numRows' is never used.
[Show more details](https://github.com/apache/druid/security/code-scanning/4729)
##########
processing/src/main/java/org/apache/druid/segment/nested/VariantArrayColumnAndIndexSupplier.java:
##########
@@ -199,7 +207,9 @@
Supplier<FrontCodedIntArrayIndexed> arrayDictionarySupplier,
Supplier<ColumnarInts> encodedValueColumnSupplier,
GenericIndexed<ImmutableBitmap> valueIndexes,
- @SuppressWarnings("unused") BitmapFactory bitmapFactory
+ @SuppressWarnings("unused") BitmapFactory bitmapFactory,
+ @SuppressWarnings("unused") ColumnConfig columnConfig,
Review Comment:
## Useless parameter
The parameter 'columnConfig' is never used.
[Show more details](https://github.com/apache/druid/security/code-scanning/4728)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis commented on pull request #13977: smarter nested column index utilization
Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #13977:
URL: https://github.com/apache/druid/pull/13977#issuecomment-1483754750
>I think there is a bug with the bound filter, it needs to wrap the predicate index in the null matching lower bound in the same way that the range index does.
Actually the mismatch is between the double predicate being used for the `DruidPredicateIndex` of the nested column, while the value matcher is using the string predicate. I'm not entirely certain how to fix this yet, since the virtual column is planned with an output type of 'string', while internally the nested field column is typed 'double' and so uses the double predicate. The predicate index might need to further specify which predicate should be used instead of just supplying the factory so that I can make this better behaved.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] imply-cheddar commented on a diff in pull request #13977: smarter nested column index utilization
Posted by "imply-cheddar (via GitHub)" <gi...@apache.org>.
imply-cheddar commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148747811
##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
public interface ColumnConfig
{
int columnCacheSizeBytes();
+
+ /**
+ * If the total number of rows in a column multiplied by this value is smaller than the total number of bitmap
+ * index operations required to perform to use a {@link LexicographicalRangeIndex} or {@link NumericRangeIndex},
+ * then for any {@link ColumnIndexSupplier} which chooses to participate in this config it will skip computing the
+ * index, in favor of doing a full scan and using a {@link org.apache.druid.query.filter.ValueMatcher} instead.
+ * This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
+ * been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the
Review Comment:
I'm scared of exposing this behavior through adjusting the return of `.as()`. I think it would be better to have the columns continue to return a meaningful object for `.as()` but have the those interfaces that benefit from this be able to opt themselves out of being used for bitmaps. (Or, in some future world, return an equivalent `ValueMatcher`)
##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -530,6 +533,10 @@ private ColumnHolder readNestedFieldColumn(String field)
columnBuilder.setHasMultipleValues(false)
.setHasNulls(hasNull)
.setDictionaryEncodedColumnSupplier(columnSupplier);
+
+ ColumnarInts intsColumn = ints.get();
+ final int size = intsColumn.size();
+ intsColumn.close();
Review Comment:
This type of pattern is a bit scary. In this instance given that it's when deserializing the object, it's probably fine, but we've had issues where we thought we were getting a thing and then closing it, only to find out that we were getting a thing that was cached and supposed to live for the lifetime of a query and the immediate close was closing things.
That said, I think we can make this cleaner by making the `Supplier<>` itself be a class that has a method that knows the number of rows represented by the thing that it supplies. It should be easy enough to implement this for the suppliers used in this case (the number is already on the jvm heap for those objects anyway) and this avoids any need to create an object and close it immediately.
##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
public interface ColumnConfig
{
int columnCacheSizeBytes();
+
+ /**
+ * If the total number of rows in a column multiplied by this value is smaller than the total number of bitmap
+ * index operations required to perform to use a {@link LexicographicalRangeIndex} or {@link NumericRangeIndex},
+ * then for any {@link ColumnIndexSupplier} which chooses to participate in this config it will skip computing the
+ * index, in favor of doing a full scan and using a {@link org.apache.druid.query.filter.ValueMatcher} instead.
+ * This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
+ * been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the
+ * number of bitmap operations is determined by how many individual values fall in the range, a subset of the columns
+ * total cardinality.
+ * <p>
+ * Currently only the {@link org.apache.druid.segment.nested.NestedFieldLiteralColumnIndexSupplier} implements this
+ * behavior.
+ * <p>
+ * This can make many standalone filters faster in cases where the overhead of doing bitmap operations to construct a
+ * {@link org.apache.druid.segment.BitmapOffset} or {@link org.apache.druid.segment.vector.BitmapVectorOffset} can
+ * often exceed the cost of just using doing a full scan and using a
+ * {@link org.apache.druid.query.filter.ValueMatcher}.
Review Comment:
I think the "many" in "This can make many standalone filters faster" is a bit of a hyperbola. What about "Some standalone filters speed up with this optimization in cases where the overhead of walking the dictionary and combining bitmaps to construct ..."?
##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -538,7 +545,10 @@ private ColumnHolder readNestedFieldColumn(String field)
localDictionarySupplier,
stringDictionarySupplier,
longDictionarySupplier,
- doubleDictionarySupplier
+ doubleDictionarySupplier,
+ size,
+ columnConfig.skipValueRangeIndexScale(),
+ columnConfig.skipValuePredicateIndexScale()
Review Comment:
Why not just pass the `columnConfig` object itself?
##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlNestedDataBenchmark.java:
##########
@@ -173,7 +173,31 @@ public String getFormatString()
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
// 28, 29
"SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0",
- "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0"
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 30, 31
+ "SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 32,33
+ "SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 34,35 smaller cardinality like range filter
+ "SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
+ // 36,37 smaller cardinality like predicate filter
+ "SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
+ // 38-39 moderate cardinality like range
+ "SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
+ // 40, 41 big cardinality lex range
+ "SELECT SUM(long1) FROM foo WHERE string5 > '1'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
+ // 42, 43 big cardinality like predicate filter
+ "SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",
Review Comment:
Can you add various sized IN clauses to your benchmark? 10, 100, 1000, 10000, 100000?
After reading the PR, I don't think they should be impacted, but it would be good to know that they aren't as they are an example of a thing that falls back to predicate-like behavior, but which we have seen also cause performance issues when run as a `ValueMatcher`
##########
processing/src/main/java/org/apache/druid/segment/nested/NestedFieldLiteralColumnIndexSupplier.java:
##########
@@ -124,6 +131,10 @@ public <T> T as(Class<T> clazz)
return (T) new NestedLiteralDictionaryEncodedStringValueIndex();
}
+ if (skipPredicateIndex && clazz.equals(DruidPredicateIndex.class)) {
+ return null;
+ }
Review Comment:
Technically, this would indicate that this column cannot implement the given the interface, so the filter in question should fall back to another interface (i.e. it might make the filter to cascade down into trying an even worse strategy). Instead, we should return a `DruidPredicateIndex` that returns `null` when asked to compute the index, indicating that it was unable to compute the index. This will make it easier to do this "correctly" when we make the change to have this object able to also return ValueMatcher objects.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] imply-cheddar commented on a diff in pull request #13977: smarter nested column index utilization
Posted by "imply-cheddar (via GitHub)" <gi...@apache.org>.
imply-cheddar commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1158089484
##########
processing/src/main/java/org/apache/druid/segment/nested/NestedFieldColumnIndexSupplier.java:
##########
@@ -119,8 +121,8 @@ public NestedFieldColumnIndexSupplier(
this.arrayElementBitmaps = arrayElementBitmaps;
this.adjustLongId = globalStringDictionarySupplier.get().size();
this.adjustDoubleId = adjustLongId + globalLongDictionarySupplier.get().size();
- this.skipRangeIndexThreshold = (int) Math.ceil(skipValueRangeIndexScale * numRows);
- this.skipPredicateIndex = localDictionarySupplier.get().size() > Math.ceil(skipValuePredicateIndexScale * numRows);
+ this.skipRangeIndexThreshold = (int) Math.ceil(columnConfig.skipValueRangeIndexScale() * numRows);
+ this.skipPredicateIndex = localDictionarySupplier.get().size() > Math.ceil(columnConfig.skipValuePredicateIndexScale() * numRows);
Review Comment:
I think it's a bit nicer to make these methods instead of fields that are computed on every constructor. The computation that's being done is minimal, that's true, but it's better to only do it when needed. I think that if you just store the reference to the `ColumnConfig` you can pretty easily move this check into private methods. All of the places where you would check for skipping the predicate index, you would probably also be grabbing a `localDictionarySupplier` anyway, so you can reuse that singular reference too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis commented on a diff in pull request #13977: smarter nested column index utilization
Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148822590
##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlNestedDataBenchmark.java:
##########
@@ -173,7 +173,31 @@ public String getFormatString()
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
// 28, 29
"SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0",
- "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0"
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 30, 31
+ "SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 32,33
+ "SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 34,35 smaller cardinality like range filter
+ "SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
+ // 36,37 smaller cardinality like predicate filter
+ "SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
+ // 38-39 moderate cardinality like range
+ "SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
+ // 40, 41 big cardinality lex range
+ "SELECT SUM(long1) FROM foo WHERE string5 > '1'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
+ // 42, 43 big cardinality like predicate filter
+ "SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",
Review Comment:
`InDimFilter` typically only uses `Utf8ValueSetIndex` and `StringValueSetIndex` which cannot be impacted by the changes currently in this PR, _unless_ it is using an `ExtractionFn` which causes it to use a `DruidPredicateIndex`. I need to figure out how to get something that would plan to something with an extractionFn in here to see how this PR would impact this filter.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis commented on pull request #13977: smarter nested column index utilization
Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #13977:
URL: https://github.com/apache/druid/pull/13977#issuecomment-1483742354
I decided to do the `DruidPredicateIndex` threshold based on the overall value cardinality instead of evaluating each predicate on the assumption that the predicates themselves are not that cheap. However, there might be room for exploration of evaluating the predicates under a certain dictionary sizes to determine the actual number of bitmap operations like we are doing for range indexes (well the `LexicographicalRangeIndex` for strings has a predicate form that further reduces the actual number of bitmaps, but that they operate on the full range before the predicate).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis commented on a diff in pull request #13977: smarter nested column index utilization
Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148825560
##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
public interface ColumnConfig
{
int columnCacheSizeBytes();
+
+ /**
+ * If the total number of rows in a column multiplied by this value is smaller than the total number of bitmap
+ * index operations required to perform to use a {@link LexicographicalRangeIndex} or {@link NumericRangeIndex},
+ * then for any {@link ColumnIndexSupplier} which chooses to participate in this config it will skip computing the
+ * index, in favor of doing a full scan and using a {@link org.apache.druid.query.filter.ValueMatcher} instead.
+ * This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
+ * been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the
Review Comment:
This was already true of `as` itself, returning `null` has always been an indicator that the index we asked for is not available. I just made it a bit clearer in the docs and also leaned into allowing the things `as` returns to also return null to indicate that an index is not available.
For the types of short circuits done in this PR, i feel like the index supplier has better information to make the decision than anything else above it since it has direct access to the size of dictionary, number of rows to be scanned, etc, but I will think on it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] imply-cheddar commented on a diff in pull request #13977: smarter nested column index utilization
Posted by "imply-cheddar (via GitHub)" <gi...@apache.org>.
imply-cheddar commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1159213704
##########
processing/src/main/java/org/apache/druid/segment/column/DruidPredicateIndex.java:
##########
@@ -36,4 +36,12 @@
*/
@Nullable
BitmapColumnIndex forPredicate(DruidPredicateFactory matcherFactory);
+
+ static boolean checkSkipThreshold(@Nullable ColumnConfig columnConfig, int numRowsToScan, int dictionaryCardinality)
Review Comment:
`shouldSkip`? `checkSkipThreshold` doesn't really tell me how I'm supposed to interpret `true/false`. Does `true` mean that it exceeded the threshold and therefore should be skipped? Or does it mean that it did not exceed the threshold and should be used?
##########
processing/src/main/java/org/apache/druid/segment/column/NumericRangeIndex.java:
##########
@@ -43,4 +43,9 @@ BitmapColumnIndex forRange(
@Nullable Number endValue,
boolean endStrict
);
+
+ static boolean checkSkipThreshold(ColumnConfig columnConfig, int numRowsToScan, int rangeSize)
+ {
+ return rangeSize > (int) Math.ceil(columnConfig.skipValueRangeIndexScale() * numRowsToScan);
Review Comment:
You aren't doing the null check on this one. That intentional?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis merged pull request #13977: smarter nested column index utilization
Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis merged PR #13977:
URL: https://github.com/apache/druid/pull/13977
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis commented on a diff in pull request #13977: smarter nested column index utilization
Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148822590
##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlNestedDataBenchmark.java:
##########
@@ -173,7 +173,31 @@ public String getFormatString()
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
// 28, 29
"SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0",
- "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0"
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 30, 31
+ "SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 32,33
+ "SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 34,35 smaller cardinality like range filter
+ "SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
+ // 36,37 smaller cardinality like predicate filter
+ "SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
+ // 38-39 moderate cardinality like range
+ "SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
+ // 40, 41 big cardinality lex range
+ "SELECT SUM(long1) FROM foo WHERE string5 > '1'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
+ // 42, 43 big cardinality like predicate filter
+ "SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",
Review Comment:
`InDimFilter` typically only uses `Utf8ValueSetIndex` and `StringValueSetIndex` which cannot be impacted by the changes currently in this PR, _unless_ it is using an `ExtractionFn`. I need to figure out how to get something that would plan to something with an extractionFn in here to see how this PR would impact this filter.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis commented on pull request #13977: smarter nested column index utilization
Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #13977:
URL: https://github.com/apache/druid/pull/13977#issuecomment-1483743051
Also looks like I might have some legitimate test failures, possibly due to mismatch of behavior between indexes and value matchers on some nested columns 😓 investigating
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org