You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/02/17 12:38:03 UTC
[GitHub] [lucene] wjp719 opened a new pull request #687: support binary search in PointValues
wjp719 opened a new pull request #687:
URL: https://github.com/apache/lucene/pull/687
add common function for caller to binary search in bkdPointTree.
One possible use case is: for log data, when indexd sort in ascend order by @timestamp field, when we want to run count aggregation query to find the count of document in many time interval, we can use the binary search to find out the min/max docId in on time interval, and the doc count=max docId- min docId +1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
use rally to compare performance of httLog small dataset
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s |
| Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s |
| Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s |
| Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s |
| 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
| 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms |
| 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
| 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
| 50th percentile service time | 200s-in-range | 97.5456 | 97.374 | -0.17156 | ms |
| 90th percentile service time | 200s-in-range | 99.0628 | 99.3968 | 0.33391 | ms |
| 99th percentile service time | 200s-in-range | 103.838 | 101.961 | -1.87675 | ms |
| 100th percentile service time | 200s-in-range | 104.579 | 120.306 | 15.7267 | ms |
| error rate | 200s-in-range | 0 | 0 | 0 | % |
| Min Throughput | 400s-in-range | 49.8772 | 49.9738 | 0.09658 | ops/s |
| Mean Throughput | 400s-in-range | 49.8831 | 49.9751 | 0.09204 | ops/s |
| Median Throughput | 400s-in-range | 49.8831 | 49.9751 | 0.09204 | ops/s |
| Max Throughput | 400s-in-range | 49.889 | 49.9765 | 0.0875 | ops/s |
| 50th percentile latency | 400s-in-range | 4.90736 | 4.90565 | -0.00171 | ms |
| 90th percentile latency | 400s-in-range | 5.36891 | 5.35761 | -0.01129 | ms |
| 99th percentile latency | 400s-in-range | 5.65861 | 5.62417 | -0.03444 | ms |
| 100th percentile latency | 400s-in-range | 6.1086 | 5.6826 | -0.42601 | ms |
| 50th percentile service time | 400s-in-range | 4.23889 | 4.19736 | -0.04152 | ms |
| 90th percentile service time | 400s-in-range | 4.59162 | 4.41089 | -0.18074 | ms |
| 99th percentile service time | 400s-in-range | 4.75771 | 4.75615 | -0.00156 | ms |
| 100th percentile service time | 400s-in-range | 5.13857 | 4.87359 | -0.26499 | ms |
```
3. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
iverase commented on a change in pull request #687:
URL: https://github.com/apache/lucene/pull/687#discussion_r815921747
##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java
##########
@@ -181,12 +189,143 @@ public int count(LeafReaderContext context) throws IOException {
};
}
+ /**
+ * Returns the first document whose packed value is greater than or equal (if allowEqual is true) to the provided packed value
+ * or -1 if all packed values are smaller than the provided one,
+ */
+ public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual) throws IOException {
+ final int numIndexDimensions = values.getNumIndexDimensions();
+ final int bytesPerDim = values.getBytesPerDimension();
+ final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
+ final Predicate<byte[]> biggerThan = testPackedValue -> {
+ for (int dim = 0; dim < numIndexDimensions; dim++) {
+ final int offset = dim * bytesPerDim;
+ if (allowEqual) {
+ if (comparator.compare(testPackedValue, offset, packedValue, offset) < 0) {
+ return false;
+ }
+ } else {
+ if (comparator.compare(testPackedValue, offset, packedValue, offset) <= 0) {
+ return false;
+ }
+ }
+ }
+ return true;
+ };
+ return nextDoc(values.getPointTree(), biggerThan);
+ }
+
+ private int nextDoc(PointValues.PointTree pointTree, Predicate<byte[]> biggerThan) throws IOException {
+ if (biggerThan.test(pointTree.getMaxPackedValue()) == false) {
+ // doc is before us
+ return -1;
+ } else if (pointTree.moveToChild()) {
+ // navigate down
+ do {
+ final int doc = nextDoc(pointTree, biggerThan);
+ if (doc != -1) {
+ return doc;
+ }
+ } while (pointTree.moveToSibling());
+ pointTree.moveToParent();
+ return -1;
+ } else {
+ // doc is in this leaf
+ final int[] doc = { -1 };
+ pointTree.visitDocValues(new IntersectVisitor() {
+ @Override
+ public void visit(int docID) {
+ throw new AssertionError("Invalid call to visit(docID)");
+ }
+
+ @Override
+ public void visit(int docID, byte[] packedValue) {
+ if (doc[0] == -1 && biggerThan.test(packedValue)) {
+ doc[0] = docID;
+ }
+ }
+
+ @Override
+ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
+ return Relation.CELL_CROSSES_QUERY;
+ }
+ });
+ return doc[0];
+ }
+ }
+
+ private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint) throws IOException {
+ final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+ for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+ int offset = dim * points.getBytesPerDimension();
+ if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0) {
+ return false;
+ }
+ if (comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+ return false;
+ }
+ if (comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) < 0
+ || comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) > 0) {
+ return false;
+ }
+ }
+ return true;
+ }
+
+ private BoundedDocSetIdIterator getDocIdSetIteratorOrNullFromBkd(LeafReaderContext context, DocIdSetIterator delegate)
+ throws IOException {
+ Sort indexSort = context.reader().getMetaData().getSort();
+ if (indexSort != null
+ && indexSort.getSort().length > 0
+ && indexSort.getSort()[0].getField().equals(field)
+ && !indexSort.getSort()[0].getReverse()) {
Review comment:
We prefer to explicitly equal to false than to use the `!` operator for readability.
##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java
##########
@@ -308,8 +449,10 @@ public int advance(int target) throws IOException {
if (target < firstDoc) {
target = firstDoc;
}
-
- int result = delegate.advance(target);
+ int result = target;
+ if(!allDocExist) {
Review comment:
We prefer to explicitly equal to false than to use the `!` operator for readability.
##########
File path: lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestIndexSortSortedNumericDocValuesRangeQuery.java
##########
@@ -483,6 +483,65 @@ public void testCount() throws IOException {
dir.close();
}
+ public void testCountWithBkd() throws IOException {
Review comment:
I think we should add a stronger random test case to cover as many cases as possible?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, it may need to create a new **SortedNumericDocValues** instance and advance from the first doc many times, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
use rally to compare performance of httLog small dataset
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.54473 | 10.1261 | 0.58137 | ops/s |
| Mean Throughput | 200s-in-range | 9.58063 | 10.14 | 0.55941 | ops/s |
| Median Throughput | 200s-in-range | 9.5815 | 10.1405 | 0.55902 | ops/s |
| Max Throughput | 200s-in-range | 9.61395 | 10.1524 | 0.53847 | ops/s |
| 50th percentile latency | 200s-in-range | 40581.6 | 37628.2 | -2953.41 | ms |
| 90th percentile latency | 200s-in-range | 43334.5 | 40271.4 | -3063.11 | ms |
| 99th percentile latency | 200s-in-range | 43949.5 | 40868.3 | -3081.2 | ms |
| 100th percentile latency | 200s-in-range | 44016.2 | 40933.7 | -3082.54 | ms |
| 50th percentile service time | 200s-in-range | 98.6711 | 96.5836 | -2.0875 | ms |
| 90th percentile service time | 200s-in-range | 100.634 | 97.772 | -2.86207 | ms |
| 99th percentile service time | 200s-in-range | 121.701 | 99.2353 | -22.4661 | ms |
| 100th percentile service time | 200s-in-range | 127.223 | 99.4463 | -27.7768 | ms
```
3. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1056088019
@jpountz @iverase now ./gradlew check all passed in my local machine, please help to run check again
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1055191448
@iverase I have simplify dimensions to 1, please review again
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1057611824
> One thing I'm not totally happy with is that the change assumes that the field is either not indexed or indexed with `LongPoint` while this query would previousy also work with `IntPoint`. Things like this would be easier to do once we know more about both how the field is indexed and how doc values have been created, e.g. via [LUCENE-10162](https://issues.apache.org/jira/browse/LUCENE-10162).
@jpountz Hi, what else should I do for this pr?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
iverase commented on a change in pull request #687:
URL: https://github.com/apache/lucene/pull/687#discussion_r816535481
##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java
##########
@@ -181,12 +189,159 @@ public int count(LeafReaderContext context) throws IOException {
};
}
+ /**
+ * Returns the first document whose packed value is greater than or equal (if allowEqual is true) to the provided packed value
+ * or -1 if all packed values are smaller than the provided one,
+ */
+ public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual) throws IOException {
+ final int numIndexDimensions = values.getNumIndexDimensions();
+ final int bytesPerDim = values.getBytesPerDimension();
+ final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
+ final Predicate<byte[]> biggerThan = testPackedValue -> {
+ for (int dim = 0; dim < numIndexDimensions; dim++) {
+ final int offset = dim * bytesPerDim;
Review comment:
Test seems to be always with one dimension and I think this query can only be used in the 1D case. Can we simplify here and everywhere where we loop through dimensions?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
use rally to compare performance of httLog small dataset
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s |
| Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s |
| Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s |
| Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s |
| 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
| 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms |
| 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
| 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
```
3. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1081341226
@jpountz @iverase Hi, is this pr ok now?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#count using bkd binary search
Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1048791215
This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster?
It's uncreal to me whether this PR buys us much.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, it may need to create a new **SortedNumericDocValues** instance and advance from the first doc many times, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
use rally to compare performance of httLog small dataset
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.92683 | 13.0162 | 3.0894 | ops/s |
| Mean Throughput | 200s-in-range | 9.94556 | 13.0482 | 3.10265 | ops/s |
| Median Throughput | 200s-in-range | 9.94556 | 13.0526 | 3.10708 | ops/s |
| Max Throughput | 200s-in-range | 9.96398 | 13.0712 | 3.10722 | ops/s |
| 50th percentile latency | 200s-in-range | 38664.7 | 25504.9 | -13159.8 | ms |
| 90th percentile latency | 200s-in-range | 41349.8 | 27291.3 | -14058.5 | ms |
| 99th percentile latency | 200s-in-range | 41954.2 | 27681.4 | -14272.8 | ms |
| 100th percentile latency | 200s-in-range | 42021.6 | 27723.8 | -14297.8 | ms |
| 50th percentile service time | 200s-in-range | 97.5456 | 73.0836 | -24.462 | ms |
| 90th percentile service time | 200s-in-range | 99.0628 | 74.586 | -24.4768 | ms |
| 99th percentile service time | 200s-in-range | 103.838 | 91.0001 | -12.8379 | ms |
| 100th percentile service time | 200s-in-range | 104.579 | 120.735 | 16.1559 | ms |
```
3. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1054244294
@iverase , Hi I move the bkd binary to the IndexSortSortedNumericDocValuesRangeQuery as you suggested, please help to review it, thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s |
| Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s |
| Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s |
| Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s |
| 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
| 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms |
| 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
| 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
```
2. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, it may need to create a new **SortedNumericDocValues** instance and advance from the first doc many times, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
use rally to compare performance of httLog small dataset
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.54473 | 10.1708 | 0.6261 | ops/s |
| Mean Throughput | 200s-in-range | 9.58063 | 10.1765 | 0.59589 | ops/s |
| Median Throughput | 200s-in-range | 9.5815 | 10.1759 | 0.59436 | ops/s |
| Max Throughput | 200s-in-range | 9.61395 | 10.1828 | 0.56884 | ops/s |
| 50th percentile latency | 200s-in-range | 40581.6 | 37436.6 | -3144.97 | ms |
| 90th percentile latency | 200s-in-range | 43334.5 | 40092.1 | -3242.37 | ms |
| 99th percentile latency | 200s-in-range | 43949.5 | 40689.7 | -3259.83 | ms |
| 100th percentile latency | 200s-in-range | 44016.2 | 40755.5 | -3260.66 | ms |
| 50th percentile service time | 200s-in-range | 98.6711 | 96.4568 | -2.21426 | ms |
| 90th percentile service time | 200s-in-range | 100.634 | 98.8512 | -1.7828 | ms |
| 99th percentile service time | 200s-in-range | 121.701 | 106.898 | -14.803 | ms |
| 100th percentile service time | 200s-in-range | 127.223 | 109.426 | -17.797 | ms |
```
3. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1054244294
@iverase , Hi I move the bkd binary search to the IndexSortSortedNumericDocValuesRangeQuery as you suggested, please help to review it, thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 commented on a change in pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 commented on a change in pull request #687:
URL: https://github.com/apache/lucene/pull/687#discussion_r816404708
##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java
##########
@@ -308,8 +449,10 @@ public int advance(int target) throws IOException {
if (target < firstDoc) {
target = firstDoc;
}
-
- int result = delegate.advance(target);
+ int result = target;
+ if(!allDocExist) {
Review comment:
done
##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java
##########
@@ -181,12 +189,143 @@ public int count(LeafReaderContext context) throws IOException {
};
}
+ /**
+ * Returns the first document whose packed value is greater than or equal (if allowEqual is true) to the provided packed value
+ * or -1 if all packed values are smaller than the provided one,
+ */
+ public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual) throws IOException {
+ final int numIndexDimensions = values.getNumIndexDimensions();
+ final int bytesPerDim = values.getBytesPerDimension();
+ final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
+ final Predicate<byte[]> biggerThan = testPackedValue -> {
+ for (int dim = 0; dim < numIndexDimensions; dim++) {
+ final int offset = dim * bytesPerDim;
+ if (allowEqual) {
+ if (comparator.compare(testPackedValue, offset, packedValue, offset) < 0) {
+ return false;
+ }
+ } else {
+ if (comparator.compare(testPackedValue, offset, packedValue, offset) <= 0) {
+ return false;
+ }
+ }
+ }
+ return true;
+ };
+ return nextDoc(values.getPointTree(), biggerThan);
+ }
+
+ private int nextDoc(PointValues.PointTree pointTree, Predicate<byte[]> biggerThan) throws IOException {
+ if (biggerThan.test(pointTree.getMaxPackedValue()) == false) {
+ // doc is before us
+ return -1;
+ } else if (pointTree.moveToChild()) {
+ // navigate down
+ do {
+ final int doc = nextDoc(pointTree, biggerThan);
+ if (doc != -1) {
+ return doc;
+ }
+ } while (pointTree.moveToSibling());
+ pointTree.moveToParent();
+ return -1;
+ } else {
+ // doc is in this leaf
+ final int[] doc = { -1 };
+ pointTree.visitDocValues(new IntersectVisitor() {
+ @Override
+ public void visit(int docID) {
+ throw new AssertionError("Invalid call to visit(docID)");
+ }
+
+ @Override
+ public void visit(int docID, byte[] packedValue) {
+ if (doc[0] == -1 && biggerThan.test(packedValue)) {
+ doc[0] = docID;
+ }
+ }
+
+ @Override
+ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
+ return Relation.CELL_CROSSES_QUERY;
+ }
+ });
+ return doc[0];
+ }
+ }
+
+ private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint) throws IOException {
+ final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+ for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+ int offset = dim * points.getBytesPerDimension();
+ if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0) {
+ return false;
+ }
+ if (comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+ return false;
+ }
+ if (comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) < 0
+ || comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) > 0) {
+ return false;
+ }
+ }
+ return true;
+ }
+
+ private BoundedDocSetIdIterator getDocIdSetIteratorOrNullFromBkd(LeafReaderContext context, DocIdSetIterator delegate)
+ throws IOException {
+ Sort indexSort = context.reader().getMetaData().getSort();
+ if (indexSort != null
+ && indexSort.getSort().length > 0
+ && indexSort.getSort()[0].getField().equals(field)
+ && !indexSort.getSort()[0].getReverse()) {
Review comment:
done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1068112283
@jpountz Hi, please help to take some time to review this pr again, thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#count using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming.
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s |
| Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s |
| Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s |
| Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s |
| 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
| 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms |
| 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
| 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
```
2. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, it may need to create a new **SortedNumericDocValues** instance and advance from the first doc many times, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
use rally to compare performance of httLog small dataset
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s |
| Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s |
| Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s |
| Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s |
| 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
| 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms |
| 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
| 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
| 50th percentile service time | 200s-in-range | 97.5456 | 97.374 | -0.17156 | ms |
| 90th percentile service time | 200s-in-range | 99.0628 | 99.3968 | 0.33391 | ms |
| 99th percentile service time | 200s-in-range | 103.838 | 101.961 | -1.87675 | ms |
| 100th percentile service time | 200s-in-range | 104.579 | 120.306 | 15.7267 | ms |
| error rate | 200s-in-range | 0 | 0 | 0 | % |
| Min Throughput | 400s-in-range | 49.8772 | 49.9738 | 0.09658 | ops/s |
| Mean Throughput | 400s-in-range | 49.8831 | 49.9751 | 0.09204 | ops/s |
| Median Throughput | 400s-in-range | 49.8831 | 49.9751 | 0.09204 | ops/s |
| Max Throughput | 400s-in-range | 49.889 | 49.9765 | 0.0875 | ops/s |
| 50th percentile latency | 400s-in-range | 4.90736 | 4.90565 | -0.00171 | ms |
| 90th percentile latency | 400s-in-range | 5.36891 | 5.35761 | -0.01129 | ms |
| 99th percentile latency | 400s-in-range | 5.65861 | 5.62417 | -0.03444 | ms |
| 100th percentile latency | 400s-in-range | 6.1086 | 5.6826 | -0.42601 | ms |
| 50th percentile service time | 400s-in-range | 4.23889 | 4.19736 | -0.04152 | ms |
| 90th percentile service time | 400s-in-range | 4.59162 | 4.41089 | -0.18074 | ms |
| 99th percentile service time | 400s-in-range | 4.75771 | 4.75615 | -0.00156 | ms |
| 100th percentile service time | 400s-in-range | 5.13857 | 4.87359 | -0.26499 | ms |
```
3. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, it may need to create a new **SortedNumericDocValues** instance and advance from the first doc many times, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
use rally to compare performance of httLog small dataset
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.54473 | 13.0162 | 3.47149 | ops/s |
| Mean Throughput | 200s-in-range | 9.58063 | 13.0482 | 3.46758 | ops/s |
| Median Throughput | 200s-in-range | 9.5815 | 13.0526 | 3.47114 | ops/s |
| Max Throughput | 200s-in-range | 9.61395 | 13.0712 | 3.45725 | ops/s |
| 50th percentile latency | 200s-in-range | 40581.6 | 25504.9 | -15076.7 | ms |
| 90th percentile latency | 200s-in-range | 43334.5 | 27291.3 | -16043.2 | ms |
| 99th percentile latency | 200s-in-range | 43949.5 | 27681.4 | -16268.1 | ms |
| 100th percentile latency | 200s-in-range | 44016.2 | 27723.8 | -16292.4 | ms |
| 50th percentile service time | 200s-in-range | 98.6711 | 73.0836 | -25.5875 | ms |
| 90th percentile service time | 200s-in-range | 100.634 | 74.586 | -26.048 | ms |
| 99th percentile service time | 200s-in-range | 121.701 | 91.0001 | -30.7012 | ms |
| 100th percentile service time | 200s-in-range | 127.223 | 120.735 | -6.48813 | ms |
```
3. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1054926857
@iverase I add a random test, please review it again
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
iverase commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053981720
I like this idea but I agree with Adrien that the API change does not look right. More over, I don't think we need to add it to PointValues as this is something specific for Index sorting and it won't apply in all situation. I think it makes little sense when we use the kd-tree as an effective r-tree like in range fields.
I think it is possible to add a specific implementation on `IndexSortSortedNumericDocValuesRangeQuery` that does not require on an API change. I came up with this implementation, something closer to that should work?
```
/**
* Returns the first document whose packed value is greater than or equal to the provided packed value
* or -1 if all packed values are smaller than the provided one,
*/
public final int nextDoc(PointValues values, byte[] packedValue) throws IOException {
final int numIndexDimensions = values.getNumIndexDimensions();
final int bytesPerDim = values.getBytesPerDimension();
final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
final Predicate<byte[]> biggerThan = testPackedValue -> {
for (int dim = 0; dim < numIndexDimensions; dim++) {
final int offset = dim * bytesPerDim;
if (comparator.compare(packedValue, offset, testPackedValue, offset) <= 0) {
return false;
}
}
return true;
};
return nextDoc(values.getPointTree(), biggerThan);
}
private int nextDoc(PointValues.PointTree pointTree, Predicate<byte[]> biggerThan) throws IOException {
if (biggerThan.test(pointTree.getMaxPackedValue())) {
// doc is before us
return -1;
} else if (pointTree.moveToChild()) {
// navigate down
do {
final int doc = nextDoc(pointTree, biggerThan);
if (doc != -1) {
return doc;
}
} while (pointTree.moveToSibling());
pointTree.moveToParent();
return -1;
} else {
// doc is in this leaf
final int[] doc = {-1};
pointTree.visitDocValues(new IntersectVisitor() {
@Override
public void visit(int docID) {
throw new AssertionError("Invalid call to visit(docID)");
}
@Override
public void visit(int docID, byte[] packedValue) {
if (doc[0] == -1 && biggerThan.test(packedValue) == false) {
doc[0] = docID;
}
}
@Override
public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
return Relation.CELL_CROSSES_QUERY;
}
});
assert doc[0] != -1;
return doc[0];
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1055460224
One thing I'm not totally happy with is that the change assumes that the field is either not indexed or indexed with `LongPoint` while this query would previousy also work with `IntPoint`. Things like this would be easier to do once we know more about both how the field is indexed and how doc values have been created, e.g. via [LUCENE-10162](https://issues.apache.org/jira/browse/LUCENE-10162).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** in ** BoundedDocSetIdIterator**to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s |
| Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s |
| Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s |
| Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s |
| 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
| 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms |
| 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
| 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
```
2. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
Posted by GitBox <gi...@apache.org>.
wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677
> This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much.
Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId.
As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming.
I also add a variable **allDocExist** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists
### benchmark result
I also test this pr performance with main branch:
#### dataset
I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc
the big one is our application log with 1.4billion doc
#### query
query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is
```
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1998-06-08T05:00:01Z",
"lt": "1998-06-15T00:00:00Z"
}
}
},
{
"match": {
"status": "200"
}
}
]
}
}
```
#### result
1. with es rally tool. (it run many times, so the disk data is cached)
```
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s |
| Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s |
| Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s |
| Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s |
| 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
| 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms |
| 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
| 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
```
2. manually run one time(clear all cache)
```
| dataSet|main branch latency|this pr latency|latency improvement|
| httpLog| 267ms| 167ms| -38%|
|our application log| 2829ms| 1093ms| -62%|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org