You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "jpountz (via GitHub)" <gi...@apache.org> on 2023/10/12 19:09:15 UTC

[I] Enable recursive graph bisection out of the box? [lucene]

jpountz opened a new issue, #12665:
URL: https://github.com/apache/lucene/issues/12665

   ### Description
   
   It would be nice to enable recursive graph bisection out of the box, so that users don't even have to know that it exists or what it is to enjoy its search-time performance benefits.
   
   
   - [x] #12489
   - [x] #12622
   - [ ] Record segments that contains blocks of documents (`IndexWriter#addDocuments` with an `s`)
   - [ ] Figure out lightweight parameters of recursive graph bisection that would make the merging overhead reasonable
   - [ ] Enable recursive graph bisection automatically when no index sort is configured and only `addDocument` is used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Enable recursive graph bisection out of the box? [lucene]

Posted by "jpountz (via GitHub)" <gi...@apache.org>.

jpountz commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1771393466

   I was just checking out a profile, and with this lightweight BP configuration, we end up spending more time on building the forward index (essentially calling `OfflineSorter` on all postings) and merging (because `SortingCodecReader` introduces overhead as it needs to re-sort e.g. postings on the fly) than on reordering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Enable recursive graph bisection out of the box? [lucene]

Posted by "jpountz (via GitHub)" <gi...@apache.org>.

jpountz commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1774523421

   These sound like great ideas!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Enable recursive graph bisection out of the box? [lucene]

Posted by "gf2121 (via GitHub)" <gi...@apache.org>.

gf2121 commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1774050262

   > essentially calling OfflineSorter on all postings
   
   FYI, I came up with some ideas to optimize this sort before, hoping to be helpful :)
   
   1. If we use a stable sorter, we can only compare docIds because termIds are already in order.
   2. If we take the maxDoc into consideration, we can save 1 round of reorder when `maxDoc < (1 << 24)`.
   3. We may even purely use an offline version of radix sorter to sort the whole file, since all we need is just 3 or 4 times reorder based on point 1 and 2.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Enable recursive graph bisection out of the box? [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1771129756

   This would be awesome to enable by default.  It would somehow disable itself if the application sets its own static index sort?
   
   It's odd/curious that `PKLookup` got slower -- maybe the loss of locality/predictability of the primary key in `luceneutil` makes BlockTree work harder for lookups.  Given the sizable gains on the other queries, we can accept some loss of indexing performance in exchange for these search-time gains, by default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Enable recursive graph bisection out of the box? [lucene]

Posted by "gf2121 (via GitHub)" <gi...@apache.org>.

gf2121 commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1775324737

   I initialized a PR on these ideas https://github.com/apache/lucene/pull/12712


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Enable recursive graph bisection out of the box? [lucene]

Posted by "jpountz (via GitHub)" <gi...@apache.org>.

jpountz commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1770827026

   I did a first indexing run on wikibigall with the following merge policy, which I tried to make as lightweight as possible:
   
   ```
       BPIndexReorderer reorderer = new BPIndexReorderer();
       reorderer.setMinDocFreq(16384);
       reorderer.setMaxIters(3);
       reorderer.setMinPartitionSize(8192);
       mp = new BPReorderingMergePolicy(mp, reorderer, 131072);
   ```
   
   Indexing ran in 3402170 msec vs. 2610068 msec without reordering, ie. 30% slower. (This is when running with default params, ie. maxBufferedDocs=12119, SerialMergeScheduler, LogDocMergePolicy (wrapped within BPReorderingMergePolicy), etc.) Search was noticeably faster:
   
   ```
                              TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                           PKLookup      279.70      (1.6%)      247.15      (2.8%)  -11.6% ( -15% -   -7%) 0.000
                           HighTerm      441.90      (7.8%)      414.88      (4.4%)   -6.1% ( -17% -    6%) 0.002
                     CountOrHighMed       94.53     (16.3%)       92.77     (15.9%)   -1.9% ( -29% -   36%) 0.715
                    CountOrHighHigh       60.96     (16.4%)       60.03     (16.4%)   -1.5% ( -29% -   37%) 0.768
                  HighTermMonthSort     4314.65      (2.3%)     4274.31      (2.8%)   -0.9% (  -5% -    4%) 0.248
                            Respell       64.51      (1.1%)       64.38      (1.6%)   -0.2% (  -2% -    2%) 0.654
                        CountPhrase        3.54     (11.2%)        3.59      (8.2%)    1.2% ( -16% -   23%) 0.700
                           Wildcard       71.79      (2.6%)       72.66      (2.7%)    1.2% (  -3% -    6%) 0.143
                             Fuzzy2       89.57      (0.9%)       90.80      (1.2%)    1.4% (   0% -    3%) 0.000
                            Prefix3      123.64      (3.4%)      125.34      (2.8%)    1.4% (  -4% -    7%) 0.159
                          CountTerm    14193.65      (3.6%)    14589.84      (2.9%)    2.8% (  -3% -    9%) 0.007
                             IntNRQ      289.67      (6.0%)      299.22      (5.6%)    3.3% (  -7% -   15%) 0.074
                         HighPhrase        5.95      (7.6%)        6.16      (9.3%)    3.6% ( -12% -   22%) 0.180
                             Fuzzy1      104.33      (0.9%)      108.08      (1.2%)    3.6% (   1% -    5%) 0.000
                          LowPhrase       17.70      (3.3%)       18.74      (4.9%)    5.9% (  -2% -   14%) 0.000
                            MedTerm      533.08      (7.8%)      568.40      (4.5%)    6.6% (  -5% -   20%) 0.001
                         OrHighHigh       56.43      (5.8%)       60.45      (7.0%)    7.1% (  -5% -   21%) 0.000
                    CountAndHighMed      124.71      (3.2%)      136.61      (4.5%)    9.5% (   1% -   17%) 0.000
                          OrHighMed      212.88      (4.0%)      233.35      (5.0%)    9.6% (   0% -   19%) 0.000
                          OrHighLow      604.12      (2.8%)      676.18      (4.5%)   11.9% (   4% -   19%) 0.000
                         AndHighLow      933.07      (2.3%)     1046.85      (2.7%)   12.2% (   7% -   17%) 0.000
                            LowTerm      947.45      (6.1%)     1091.11      (4.7%)   15.2% (   4% -   27%) 0.000
                         AndHighMed      197.62      (3.1%)      232.11      (3.1%)   17.5% (  10% -   24%) 0.000
                          MedPhrase       42.93      (2.7%)       50.47      (5.4%)   17.6% (   9% -   26%) 0.000
                        AndHighHigh       52.74      (4.0%)       63.41      (4.6%)   20.2% (  11% -   30%) 0.000
                   CountAndHighHigh       41.69      (3.5%)       50.60      (5.6%)   21.4% (  11% -   31%) 0.000
              HighTermDayOfYearSort      444.76      (1.7%)      652.11      (2.0%)   46.6% (  42% -   51%) 0.000
   ```
   
   I'll look into whether I can reduce the merge-time overhead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Enable recursive graph bisection out of the box? [lucene]

Posted by "jpountz (via GitHub)" <gi...@apache.org>.

jpountz commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1771171862

   >  It would somehow disable itself if the application sets its own static index sort?
   
   This is correct. This bit already works on the PR, IndexWriter doesn't check the new method that `OneMerge` got that allows reordering the resulting segment if an index sort is configured in the `IndexWriterConfig`.
   
   > It's odd/curious that PKLookup got slower
   
   The terms dictionary should be exactly the same, only doc IDs get renumbered. BlockTree optimizes for incremental IDs by storing the delta of a singleton doc ID with the previous doc ID (`termState.singletonDocID += BitUtil.zigZagDecode(l >>> 1);` in `Lucene90PostingsReader`). I wonder that it doesn't only help with storage, but also with lookups since it's stored as a single-byte vInt in the incremental case, but often on a variable number of bytes after reordering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org