You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/21 11:25:38 UTC
[GitHub] [druid] rohangarg opened a new pull request, #13133: Composite approach for checking in-filter values set in column dictionary
rohangarg opened a new pull request, #13133:
URL: https://github.com/apache/druid/pull/13133
Currently, for checking the in-filter values in a column dictionary we use a binary search per value in the set. That works well for smaller value-sets but starts slowing down as the number of values in the set increase. To accommodate for large value-sets arising from large in-filters or from joins being pushed down as in-filters, we use sorted merge algorithm for merging the set and dictionary for larger values.
The following benchmark was run to find the cutoff point :
```
Benchmark (dictionarySize) (filterToDictionaryPercentage) (selectivityPercentage) Mode Cnt Score Error Units
binarySearch 1000000 1 10 avgt 10 11271.575 ± 621.620 us/op
binarySearch 1000000 1 100 avgt 10 16751.127 ± 580.550 us/op
binarySearch 1000000 2 10 avgt 10 25387.245 ± 1795.582 us/op
binarySearch 1000000 2 100 avgt 10 19707.334 ± 973.384 us/op
binarySearch 1000000 5 10 avgt 10 37256.779 ± 2287.203 us/op
binarySearch 1000000 5 100 avgt 10 47391.511 ± 1689.971 us/op
binarySearch 1000000 10 10 avgt 10 76204.615 ± 6515.056 us/op
binarySearch 1000000 10 100 avgt 10 71483.416 ± 7197.376 us/op
binarySearch 1000000 12 10 avgt 10 139481.091 ± 15513.357 us/op
binarySearch 1000000 12 100 avgt 10 142881.846 ± 9511.367 us/op
binarySearch 1000000 15 10 avgt 10 113786.273 ± 4123.158 us/op
binarySearch 1000000 15 100 avgt 10 165300.278 ± 10555.479 us/op
binarySearch 1000000 20 10 avgt 10 138410.942 ± 14330.367 us/op
binarySearch 1000000 20 100 avgt 10 137543.621 ± 9273.845 us/op
binarySearch 1000000 30 10 avgt 10 206512.608 ± 13497.954 us/op
binarySearch 1000000 30 100 avgt 10 305317.908 ± 12452.686 us/op
binarySearch 1000000 50 10 avgt 10 328867.893 ± 14594.249 us/op
binarySearch 1000000 50 100 avgt 10 332823.291 ± 26349.158 us/op
binarySearch 1000000 100 10 avgt 10 668720.906 ± 49193.818 us/op
binarySearch 1000000 100 100 avgt 10 1014546.405 ± 30709.670 us/op
sortedMerge 1000000 1 10 avgt 10 47220.743 ± 2598.755 us/op
sortedMerge 1000000 1 100 avgt 10 53634.485 ± 2886.296 us/op
sortedMerge 1000000 2 10 avgt 10 51201.356 ± 2745.801 us/op
sortedMerge 1000000 2 100 avgt 10 53046.058 ± 4500.987 us/op
sortedMerge 1000000 5 10 avgt 10 58501.742 ± 7320.870 us/op
sortedMerge 1000000 5 100 avgt 10 65597.519 ± 7356.548 us/op
sortedMerge 1000000 10 10 avgt 10 75347.468 ± 9417.556 us/op
sortedMerge 1000000 10 100 avgt 10 74601.584 ± 5251.078 us/op
sortedMerge 1000000 12 10 avgt 10 64838.734 ± 2644.738 us/op
sortedMerge 1000000 12 100 avgt 10 80342.737 ± 5414.306 us/op
sortedMerge 1000000 15 10 avgt 10 83345.836 ± 6034.488 us/op
sortedMerge 1000000 15 100 avgt 10 78405.299 ± 1651.375 us/op
sortedMerge 1000000 20 10 avgt 10 111307.577 ± 10924.456 us/op
sortedMerge 1000000 20 100 avgt 10 89371.173 ± 10347.814 us/op
sortedMerge 1000000 30 10 avgt 10 116740.355 ± 6003.945 us/op
sortedMerge 1000000 30 100 avgt 10 117312.763 ± 6325.076 us/op
sortedMerge 1000000 50 10 avgt 10 132145.411 ± 23121.259 us/op
sortedMerge 1000000 50 100 avgt 10 192802.722 ± 39032.876 us/op
sortedMerge 1000000 100 10 avgt 10 216079.634 ± 28725.333 us/op
sortedMerge 1000000 100 100 avgt 10 236561.476 ± 14369.673 us/op
```
This PR has:
- [x] been self-reviewed.
- [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] FrankChen021 commented on a diff in pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
FrankChen021 commented on code in PR #13133:
URL: https://github.com/apache/druid/pull/13133#discussion_r977120952
##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedStringIndexSupplier.java:
##########
@@ -175,6 +177,8 @@ public String getValue(int index)
extends BaseGenericIndexedDictionaryEncodedIndex<ByteBuffer> implements StringValueSetIndex, Utf8ValueSetIndex
{
private static final int SIZE_WORTH_CHECKING_MIN = 8;
+ private static final double SORTED_MERGE_RATIO_THRESHOLD = 0.12D;
Review Comment:
Could you add some javadoc to explain why the default threshold is 0.12?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] rohangarg merged pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
rohangarg merged PR #13133:
URL: https://github.com/apache/druid/pull/13133
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] kfaraz commented on a diff in pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
kfaraz commented on code in PR #13133:
URL: https://github.com/apache/druid/pull/13133#discussion_r976604210
##########
processing/src/main/java/org/apache/druid/segment/data/GenericIndexed.java:
##########
@@ -826,4 +862,28 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector)
}
};
}
+
+ public class ValueWithIndex
Review Comment:
I guess it would be cleaner to just use a `ListIterator` which provides `nextIndex()`.
You wouldn't be able to peek the next index though, and you might have to work around that.
(That could be easier to do if we go with @FrankChen021 's suggestion to separate the two kinds of
searches into two different iterables.)
Another alternative could be to just use `Pair` but I am not a fan of it.
If you do decide to use this class, however, I would suggest putting as a top level class in `druid-core/org.apache.druid.java.util.common`, as other parts of the code might have similar requirements.
##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedStringIndexSupplier.java:
##########
@@ -280,15 +287,35 @@ public ImmutableBitmap next()
private void findNext()
{
- while (next < 0 && iterator.hasNext()) {
- ByteBuffer nextValue = iterator.next();
- next = dictionary.indexOf(nextValue);
-
- if (next == -dictionarySize - 1) {
- // nextValue is past the end of the dictionary.
- // Note: we can rely on indexOf returning (-(insertion point) - 1), even though Indexed doesn't
- // guarantee it, because "dictionary" comes from GenericIndexed singleThreaded().
- break;
+ // if the size of in-filter values is less than the threshold percentage of dictionary size, then use binary search
+ // based lookup per value. The algorithm works well for smaller number of values.
+ if (size < SORTED_MERGE_RATIO_THRESHOLD * dictionary.size()) {
Review Comment:
Yes, that would be much more readable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] cheddar commented on a diff in pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
cheddar commented on code in PR #13133:
URL: https://github.com/apache/druid/pull/13133#discussion_r992881263
##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedStringIndexSupplier.java:
##########
@@ -280,8 +351,10 @@ public ImmutableBitmap next()
private void findNext()
{
- while (next < 0 && iterator.hasNext()) {
- ByteBuffer nextValue = iterator.next();
+ // if the size of in-filter values is less than the threshold percentage of dictionary size, then use binary search
+ // based lookup per value. The algorithm works well for smaller number of values.
Review Comment:
I *think* this comment is a bit stale now as the choice of which algorithm to use happens earlier in the code than this line.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #13133:
URL: https://github.com/apache/druid/pull/13133#discussion_r979218453
##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedStringIndexSupplier.java:
##########
@@ -175,6 +177,8 @@ public String getValue(int index)
extends BaseGenericIndexedDictionaryEncodedIndex<ByteBuffer> implements StringValueSetIndex, Utf8ValueSetIndex
{
private static final int SIZE_WORTH_CHECKING_MIN = 8;
+ private static final double SORTED_MERGE_RATIO_THRESHOLD = 0.12D;
Review Comment:
We could link the issue that has the benchmarks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] rohangarg commented on pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
rohangarg commented on PR #13133:
URL: https://github.com/apache/druid/pull/13133#issuecomment-1271323630
> @rohangarg - thanks for putting these benchmarks. In terms of query latencies, what difference have you observed?
I have not tried benchmarking queries as of now since the amount of difference would depend on the data and the type of query being used. The benchmark added can measure the improvement with in-filter operation independently.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] FrankChen021 commented on a diff in pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
FrankChen021 commented on code in PR #13133:
URL: https://github.com/apache/druid/pull/13133#discussion_r977120670
##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedStringIndexSupplier.java:
##########
@@ -280,15 +287,35 @@ public ImmutableBitmap next()
private void findNext()
{
- while (next < 0 && iterator.hasNext()) {
- ByteBuffer nextValue = iterator.next();
- next = dictionary.indexOf(nextValue);
-
- if (next == -dictionarySize - 1) {
- // nextValue is past the end of the dictionary.
- // Note: we can rely on indexOf returning (-(insertion point) - 1), even though Indexed doesn't
- // guarantee it, because "dictionary" comes from GenericIndexed singleThreaded().
- break;
+ // if the size of in-filter values is less than the threshold percentage of dictionary size, then use binary search
+ // based lookup per value. The algorithm works well for smaller number of values.
+ if (size < SORTED_MERGE_RATIO_THRESHOLD * dictionary.size()) {
Review Comment:
We can determine the strategy at the point of `Iterator<ImmutableBitmap>` object is instantiated.
And it would be much better if we split the returned `Iterator<ImmutableBitmap>` into two inner classes, one is for the binary search, the other is for the sorted merge.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] rohangarg commented on pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
rohangarg commented on PR #13133:
URL: https://github.com/apache/druid/pull/13133#issuecomment-1271320493
The force-push is done since I rebased from latest master. The new changes are a part of new commit and aren't squashed into the old ones.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] rohangarg commented on a diff in pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
rohangarg commented on code in PR #13133:
URL: https://github.com/apache/druid/pull/13133#discussion_r994235482
##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedStringIndexSupplier.java:
##########
@@ -280,8 +351,10 @@ public ImmutableBitmap next()
private void findNext()
{
- while (next < 0 && iterator.hasNext()) {
- ByteBuffer nextValue = iterator.next();
+ // if the size of in-filter values is less than the threshold percentage of dictionary size, then use binary search
+ // based lookup per value. The algorithm works well for smaller number of values.
Review Comment:
thanks for the catch! :+1: moved up the comment to the correct location
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] abhishekagarwal87 commented on pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on PR #13133:
URL: https://github.com/apache/druid/pull/13133#issuecomment-1256933206
@rohangarg - thanks for putting these benchmarks. In terms of query latencies, what difference have you observed?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] rohangarg commented on a diff in pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
rohangarg commented on code in PR #13133:
URL: https://github.com/apache/druid/pull/13133#discussion_r989862742
##########
processing/src/main/java/org/apache/druid/segment/data/GenericIndexed.java:
##########
@@ -826,4 +862,28 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector)
}
};
}
+
+ public class ValueWithIndex
Review Comment:
removed the iterator itself, so not needed anymore.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] rohangarg commented on a diff in pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
rohangarg commented on code in PR #13133:
URL: https://github.com/apache/druid/pull/13133#discussion_r989861633
##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedStringIndexSupplier.java:
##########
@@ -280,15 +287,35 @@ public ImmutableBitmap next()
private void findNext()
{
- while (next < 0 && iterator.hasNext()) {
- ByteBuffer nextValue = iterator.next();
- next = dictionary.indexOf(nextValue);
-
- if (next == -dictionarySize - 1) {
- // nextValue is past the end of the dictionary.
- // Note: we can rely on indexOf returning (-(insertion point) - 1), even though Indexed doesn't
- // guarantee it, because "dictionary" comes from GenericIndexed singleThreaded().
- break;
+ // if the size of in-filter values is less than the threshold percentage of dictionary size, then use binary search
+ // based lookup per value. The algorithm works well for smaller number of values.
+ if (size < SORTED_MERGE_RATIO_THRESHOLD * dictionary.size()) {
Review Comment:
Makes sense, done. Had done it earlier like that, but changed it last moment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] rohangarg commented on a diff in pull request #13133: Composite approach for checking in-filter values set in column dictionary
Posted by GitBox <gi...@apache.org>.
rohangarg commented on code in PR #13133:
URL: https://github.com/apache/druid/pull/13133#discussion_r989862506
##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedStringIndexSupplier.java:
##########
@@ -175,6 +177,8 @@ public String getValue(int index)
extends BaseGenericIndexedDictionaryEncodedIndex<ByteBuffer> implements StringValueSetIndex, Utf8ValueSetIndex
{
private static final int SIZE_WORTH_CHECKING_MIN = 8;
+ private static final double SORTED_MERGE_RATIO_THRESHOLD = 0.12D;
Review Comment:
Added javadoc explanation for the threshold
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org