You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "jpountz (via GitHub)" <gi...@apache.org> on 2023/10/12 19:09:15 UTC
[I] Enable recursive graph bisection out of the box? [lucene]
jpountz opened a new issue, #12665:
URL: https://github.com/apache/lucene/issues/12665
### Description
It would be nice to enable recursive graph bisection out of the box, so that users don't even have to know that it exists or what it is to enjoy its search-time performance benefits.
- [x] #12489
- [x] #12622
- [ ] Record segments that contains blocks of documents (`IndexWriter#addDocuments` with an `s`)
- [ ] Figure out lightweight parameters of recursive graph bisection that would make the merging overhead reasonable
- [ ] Enable recursive graph bisection automatically when no index sort is configured and only `addDocument` is used.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [I] Enable recursive graph bisection out of the box? [lucene]
Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1771393466
I was just checking out a profile, and with this lightweight BP configuration, we end up spending more time on building the forward index (essentially calling `OfflineSorter` on all postings) and merging (because `SortingCodecReader` introduces overhead as it needs to re-sort e.g. postings on the fly) than on reordering.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [I] Enable recursive graph bisection out of the box? [lucene]
Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1774523421
These sound like great ideas!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [I] Enable recursive graph bisection out of the box? [lucene]
Posted by "gf2121 (via GitHub)" <gi...@apache.org>.
gf2121 commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1774050262
> essentially calling OfflineSorter on all postings
FYI, I came up with some ideas to optimize this sort before, hoping to be helpful :)
1. If we use a stable sorter, we can only compare docIds because termIds are already in order.
2. If we take the maxDoc into consideration, we can save 1 round of reorder when `maxDoc < (1 << 24)`.
3. We may even purely use an offline version of radix sorter to sort the whole file, since all we need is just 3 or 4 times reorder based on point 1 and 2.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [I] Enable recursive graph bisection out of the box? [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1771129756
This would be awesome to enable by default. It would somehow disable itself if the application sets its own static index sort?
It's odd/curious that `PKLookup` got slower -- maybe the loss of locality/predictability of the primary key in `luceneutil` makes BlockTree work harder for lookups. Given the sizable gains on the other queries, we can accept some loss of indexing performance in exchange for these search-time gains, by default?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [I] Enable recursive graph bisection out of the box? [lucene]
Posted by "gf2121 (via GitHub)" <gi...@apache.org>.
gf2121 commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1775324737
I initialized a PR on these ideas https://github.com/apache/lucene/pull/12712
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [I] Enable recursive graph bisection out of the box? [lucene]
Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1770827026
I did a first indexing run on wikibigall with the following merge policy, which I tried to make as lightweight as possible:
```
BPIndexReorderer reorderer = new BPIndexReorderer();
reorderer.setMinDocFreq(16384);
reorderer.setMaxIters(3);
reorderer.setMinPartitionSize(8192);
mp = new BPReorderingMergePolicy(mp, reorderer, 131072);
```
Indexing ran in 3402170 msec vs. 2610068 msec without reordering, ie. 30% slower. (This is when running with default params, ie. maxBufferedDocs=12119, SerialMergeScheduler, LogDocMergePolicy (wrapped within BPReorderingMergePolicy), etc.) Search was noticeably faster:
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 279.70 (1.6%) 247.15 (2.8%) -11.6% ( -15% - -7%) 0.000
HighTerm 441.90 (7.8%) 414.88 (4.4%) -6.1% ( -17% - 6%) 0.002
CountOrHighMed 94.53 (16.3%) 92.77 (15.9%) -1.9% ( -29% - 36%) 0.715
CountOrHighHigh 60.96 (16.4%) 60.03 (16.4%) -1.5% ( -29% - 37%) 0.768
HighTermMonthSort 4314.65 (2.3%) 4274.31 (2.8%) -0.9% ( -5% - 4%) 0.248
Respell 64.51 (1.1%) 64.38 (1.6%) -0.2% ( -2% - 2%) 0.654
CountPhrase 3.54 (11.2%) 3.59 (8.2%) 1.2% ( -16% - 23%) 0.700
Wildcard 71.79 (2.6%) 72.66 (2.7%) 1.2% ( -3% - 6%) 0.143
Fuzzy2 89.57 (0.9%) 90.80 (1.2%) 1.4% ( 0% - 3%) 0.000
Prefix3 123.64 (3.4%) 125.34 (2.8%) 1.4% ( -4% - 7%) 0.159
CountTerm 14193.65 (3.6%) 14589.84 (2.9%) 2.8% ( -3% - 9%) 0.007
IntNRQ 289.67 (6.0%) 299.22 (5.6%) 3.3% ( -7% - 15%) 0.074
HighPhrase 5.95 (7.6%) 6.16 (9.3%) 3.6% ( -12% - 22%) 0.180
Fuzzy1 104.33 (0.9%) 108.08 (1.2%) 3.6% ( 1% - 5%) 0.000
LowPhrase 17.70 (3.3%) 18.74 (4.9%) 5.9% ( -2% - 14%) 0.000
MedTerm 533.08 (7.8%) 568.40 (4.5%) 6.6% ( -5% - 20%) 0.001
OrHighHigh 56.43 (5.8%) 60.45 (7.0%) 7.1% ( -5% - 21%) 0.000
CountAndHighMed 124.71 (3.2%) 136.61 (4.5%) 9.5% ( 1% - 17%) 0.000
OrHighMed 212.88 (4.0%) 233.35 (5.0%) 9.6% ( 0% - 19%) 0.000
OrHighLow 604.12 (2.8%) 676.18 (4.5%) 11.9% ( 4% - 19%) 0.000
AndHighLow 933.07 (2.3%) 1046.85 (2.7%) 12.2% ( 7% - 17%) 0.000
LowTerm 947.45 (6.1%) 1091.11 (4.7%) 15.2% ( 4% - 27%) 0.000
AndHighMed 197.62 (3.1%) 232.11 (3.1%) 17.5% ( 10% - 24%) 0.000
MedPhrase 42.93 (2.7%) 50.47 (5.4%) 17.6% ( 9% - 26%) 0.000
AndHighHigh 52.74 (4.0%) 63.41 (4.6%) 20.2% ( 11% - 30%) 0.000
CountAndHighHigh 41.69 (3.5%) 50.60 (5.6%) 21.4% ( 11% - 31%) 0.000
HighTermDayOfYearSort 444.76 (1.7%) 652.11 (2.0%) 46.6% ( 42% - 51%) 0.000
```
I'll look into whether I can reduce the merge-time overhead.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [I] Enable recursive graph bisection out of the box? [lucene]
Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1771171862
> It would somehow disable itself if the application sets its own static index sort?
This is correct. This bit already works on the PR, IndexWriter doesn't check the new method that `OneMerge` got that allows reordering the resulting segment if an index sort is configured in the `IndexWriterConfig`.
> It's odd/curious that PKLookup got slower
The terms dictionary should be exactly the same, only doc IDs get renumbered. BlockTree optimizes for incremental IDs by storing the delta of a singleton doc ID with the previous doc ID (`termState.singletonDocID += BitUtil.zigZagDecode(l >>> 1);` in `Lucene90PostingsReader`). I wonder that it doesn't only help with storage, but also with lookups since it's stored as a single-byte vInt in the incremental case, but often on a variable number of bytes after reordering.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org