You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "slow-J (via GitHub)" <gi...@apache.org> on 2023/10/18 17:01:44 UTC

[I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

slow-J opened a new issue, #12696:
URL: https://github.com/apache/lucene/issues/12696

   ### Description
   
   Background: In https://github.com/Tony-X/search-benchmark-game we were comparing performance of Tantivy and Lucene. "One difference between Lucene and Tantivy is Lucene uses the "patch" FOR, meaning the large values in a block are held out as exceptions so that the remaining values can use a smaller number of bits to encode, a tradeoff of CPU for lower storage space." In https://github.com/Tony-X/search-benchmark-game/issues/46 , I disable the patching in Lucene, to match how Tantivy encodes and run the search-benchmark-game to test the change.
   
   Lucene modifications for testing: I cloned the pforUtil and removed all logic related to patching the exceptions. I modified the Lucene90PostingsReader + Writer to use the util with no patching logic, see sample code https://github.com/slow-J/lucene/commit/83ec5a8b9f7ed39b8aa3ee948ffe5288a9d3fb16
   Hardware used: EC2 Graviton3 instance, m6g.4xlarge
   
   Results from the search-benchmark-game: https://github.com/Tony-X/search-benchmark-game/issues/46#issuecomment-1693714327
   We saw Lucene's latency improve: -2% in COUNT, -2% in TOP_10_COUNT, -2.07% in TOP_100.
   
   I then ran a Lucene benchmark with [luceneutil](https://github.com/mikemccand/luceneutil) `python3 src/python/localrun.py -source wikimediumall -r`
   Hardware used: EC2 Graviton3 instance, m6g.4xlarge
   
   Posting results below
   ```
                               TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               BrowseDateSSDVFacets        0.95      (4.5%)        0.95      (5.9%)   -1.0% ( -10% -    9%) 0.566
                  HighTermMonthSort     2558.85      (2.6%)     2539.66      (4.6%)   -0.7% (  -7% -    6%) 0.526
        BrowseRandomLabelSSDVFacets        2.48      (4.3%)        2.47      (1.4%)   -0.7% (  -6% -    5%) 0.486
                            Prefix3      139.77      (1.7%)      139.41      (3.1%)   -0.3% (  -4% -    4%) 0.735
        BrowseRandomLabelTaxoFacets        3.38      (1.5%)        3.37      (1.8%)   -0.2% (  -3% -    3%) 0.689
                             Fuzzy2       50.20      (2.0%)       50.11      (2.1%)   -0.2% (  -4% -    3%) 0.776
                            MedTerm      439.13      (3.9%)      438.94      (4.3%)   -0.0% (  -7% -    8%) 0.974
              BrowseMonthSSDVFacets        3.48      (1.8%)        3.48      (1.5%)    0.0% (  -3% -    3%) 0.949
                            Respell       37.97      (2.9%)       37.99      (2.7%)    0.1% (  -5% -    5%) 0.945
                            LowTerm      323.73      (4.2%)      323.93      (3.9%)    0.1% (  -7% -    8%) 0.960
                             Fuzzy1       45.01      (2.5%)       45.06      (2.6%)    0.1% (  -4% -    5%) 0.896
                           PKLookup      154.24      (2.6%)      154.41      (2.1%)    0.1% (  -4% -    4%) 0.886
                           Wildcard       71.96      (1.4%)       72.04      (1.9%)    0.1% (  -3% -    3%) 0.823
          BrowseDayOfYearSSDVFacets        3.32      (1.8%)        3.33      (1.9%)    0.2% (  -3% -    4%) 0.789
                             IntNRQ       27.77      (8.5%)       27.84      (9.6%)    0.3% ( -16% -   19%) 0.929
                       OrHighNotLow      183.77      (6.9%)      184.66      (5.4%)    0.5% ( -11% -   13%) 0.805
               HighTermTitleBDVSort        4.81      (3.0%)        4.83      (3.0%)    0.5% (  -5% -    6%) 0.566
                           HighTerm      496.37      (5.5%)      499.12      (5.2%)    0.6% (  -9% -   11%) 0.745
                         TermDTSort      109.66      (1.4%)      110.31      (1.3%)    0.6% (  -1% -    3%) 0.155
                   HighSloppyPhrase       23.39      (4.2%)       23.56      (3.8%)    0.7% (  -7% -    9%) 0.580
              BrowseMonthTaxoFacets        3.88      (2.5%)        3.91      (1.7%)    0.8% (  -3% -    5%) 0.261
          BrowseDayOfYearTaxoFacets        3.89      (2.5%)        3.92      (0.9%)    0.8% (  -2% -    4%) 0.171
               BrowseDateTaxoFacets        3.88      (2.4%)        3.91      (0.8%)    0.9% (  -2% -    4%) 0.117
                      OrHighNotHigh      270.96      (5.3%)      273.88      (4.2%)    1.1% (  -7% -   11%) 0.474
                       OrHighNotMed      202.84      (6.3%)      205.09      (4.7%)    1.1% (  -9% -   12%) 0.531
               HighIntervalsOrdered        4.36      (4.2%)        4.41      (4.3%)    1.1% (  -7% -   10%) 0.399
                  HighTermTitleSort       20.12      (5.5%)       20.35      (5.6%)    1.2% (  -9% -   12%) 0.511
               MedTermDayTaxoFacets        9.42      (3.2%)        9.53      (4.3%)    1.2% (  -6% -    8%) 0.325
                         OrHighHigh       12.07      (4.9%)       12.22      (4.9%)    1.3% (  -8% -   11%) 0.415
                       HighSpanNear        9.92      (3.5%)       10.06      (3.3%)    1.3% (  -5% -    8%) 0.212
                      OrNotHighHigh      356.64      (4.5%)      361.84      (3.3%)    1.5% (  -6% -    9%) 0.242
                MedIntervalsOrdered        3.09      (2.9%)        3.14      (3.0%)    1.5% (  -4% -    7%) 0.119
             OrHighMedDayTaxoFacets        1.54      (4.6%)        1.56      (5.3%)    1.5% (  -8% -   11%) 0.338
                       OrNotHighMed      346.18      (3.3%)      351.65      (3.6%)    1.6% (  -5% -    8%) 0.144
                    LowSloppyPhrase       21.58      (2.6%)       21.94      (2.0%)    1.7% (  -2% -    6%) 0.025
                          OrHighMed       48.51      (3.4%)       49.50      (3.5%)    2.0% (  -4% -    9%) 0.062
                        MedSpanNear       26.76      (1.4%)       27.35      (2.2%)    2.2% (  -1% -    5%) 0.000
              HighTermDayOfYearSort      223.99      (2.7%)      228.97      (2.1%)    2.2% (  -2% -    7%) 0.004
                          OrHighLow      208.96      (3.2%)      213.65      (3.0%)    2.2% (  -3% -    8%) 0.022
                        AndHighHigh       13.93      (3.6%)       14.25      (3.7%)    2.4% (  -4% -   10%) 0.042
                    MedSloppyPhrase       19.45      (2.5%)       19.92      (2.0%)    2.4% (  -1% -    7%) 0.001
           AndHighHighDayTaxoFacets        2.12      (3.9%)        2.17      (5.2%)    2.4% (  -6% -   11%) 0.094
                        LowSpanNear        8.99      (1.6%)        9.24      (1.4%)    2.7% (   0% -    5%) 0.000
                LowIntervalsOrdered        7.46      (2.6%)        7.67      (3.1%)    2.8% (  -2% -    8%) 0.002
                         HighPhrase       12.54      (2.8%)       12.91      (3.3%)    2.9% (  -3% -    9%) 0.003
                          MedPhrase       25.45      (1.8%)       26.22      (2.1%)    3.0% (   0% -    6%) 0.000
            AndHighMedDayTaxoFacets       16.49      (1.7%)       17.08      (2.3%)    3.6% (   0% -    7%) 0.000
                         AndHighMed       91.80      (2.2%)       95.26      (2.3%)    3.8% (   0% -    8%) 0.000
                          LowPhrase      144.04      (1.5%)      150.03      (2.0%)    4.2% (   0% -    7%) 0.000
                       OrNotHighLow      315.83      (1.8%)      330.04      (2.6%)    4.5% (   0% -    9%) 0.000
                         AndHighLow      352.57      (2.7%)      379.44      (3.7%)    7.6% (   1% -   14%) 0.000
   ```
   The tasks at the bottom of the table had the larger QPS improvement.
   AndHighLow has a QPS improvement of +7.6%!
   
   Size of test candidate index:   17.626 GiB	total
   Size of test baseline index:   18.401 GiB	total
   
   This change would bring a 4.39691% increase in index size.
   
   ### Proposal
   I propose adding an option to Lucene's codec that allows users to disable patching in the PFOR encoding, providing the option to leverage the performance benefits observed here at a cost of index size.
   I would appreciate all feedback and further evaluation of this idea by the Lucene community.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.

Tony-X commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775807115

   > In 11.0, remove all patching logic which will, a) simplify the code a bit, and b) remove the (likely minor) overhead on read of looking up the number of patches in a block, which is always 0.
   
   I like this think of it as an index-level property rather than a block-level property. We can just check once instead of doing it per-block. 
   
   Maybe write something in the index header to indicate if patching is there (default to yes - in 9.x ). Then new indexes will write additional header to indicate there is not patching in the whole index (10.x behavior). Then remove the header as well as the patching support altogether in 11.x.     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "jpountz (via GitHub)" <gi...@apache.org>.

jpountz commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1779221543

For reference, Lucene used to use FOR for postings and PFOR for positions in 8.x. This was changed in 9.0 via #69 to use PFOR for both postings and positions. This PR says it made the index 3% smaller with no performance impact, but I can believe that we are noticing an impact now as many things changed in the meantime. I'm +1 to switching back to FOR if it yields better performance.

I have a preference for keeping PFOR for positions and only moving postings to FOR (essentially reverting #69). The benchmark in this issue description used wikimedium, which by design doesn't have much position data since all documents are truncated. Using PFOR for positions and FOR for postings sounds like a good trade-off to me as positions are less important for performance typically. And if someone wants better performance for their phrase queries, it would likely be a better idea to use a `CommonGramsFilter` than to switch positions from PFOR to FOR?

I remember we observed a [15% reduction](https://twitter.com/jpountz/status/1486608300905009156) of our inverted indexes for logs when Lucene moved from FOR to PFOR at Elastic, but I don't think it should block this change, Elasticsearch can maintain its own postings format that uses PFOR for postings. I'm just mentioning it as a way to highlight that I'm expecting that some users will observe an increased disk usage that is more than 3%.

Regarding backward compatibility, let's do it with codecs as usual: fork `Lucene90PostingsFormat` into a new `Lucene99PostingsFormat` that uses PFOR for postings. Then the codec infrastructure will make sure to keep using the old postings format for existing segments and the new postings format for new segments (including merged ones).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "slow-J (via GitHub)" <gi...@apache.org>.

slow-J commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1787521225

   Thanks all for the feedback. Will proceed with removing patching only for doc blocks (reverting some of https://github.com/apache/lucene/pull/69)
   
   All the changes needed to create a `Lucene99PostingsFormat` has made the PR quite large, so I am starting it as a draft PR, https://github.com/apache/lucene/pull/12741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.

Tony-X commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775659415

   > It would be fully back-compat whether we release in 9.9 vs 10.0.
   
   Hmm, can you elaborate how it can be fully backwards-compatible on with the indexes that have patching? 
   
   Is the assumption that we will introduce an option to disable patching? I thought we are thinking to remove it entirely. Maybe I missed something...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "slow-J (via GitHub)" <gi...@apache.org>.

slow-J commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1790638953

   > Another exciting optimization such a "patch-less" encoding could implement is within-block skipping (I believe Tantivy does this).
   > 
   > Today, our skipper is forced to align to block boundaries, so when we skip to a given `docid`, we go to the block that may contain this `docid`, decode all 128 `int[]`, then linearly scan within those 128 ints. This is quite a bit of overhead for each skip request!
   > 
   > If we could lower that linear scan cost to maybe 16 or 8 or something, the conjunctive queries should get even faster. But perhaps it becomes trickier to take advantage of SIMD optimizations if we are decoding a subset of ints, not sure.
   
   I have created https://github.com/apache/lucene/issues/12749 to explore this idea!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "jpountz (via GitHub)" <gi...@apache.org>.

jpountz commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1786969801

   > Normally the IntNRQ (1D points numeric range query) is very noisy, but maybe this gain is real? p-value seems to think it could be close to real?
   
   I'm not sure how it could not be noise, since it never needs to decode postings?
   
   Maybe we should also look into the trade-off of keeping freqs on PFOR vs. switching them to FOR, given that:
    - Non-scoring queries never need freqs, e.g. the counting queries would still see the same speedups.
    - Conjunctions only need to decode freq blocks when all clauses agree on a doc.
    - Both disjunctions and conjunctions can skip decoding freq blocks if max scores are not high enough for a hit to be competitive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "slow-J (via GitHub)" <gi...@apache.org>.

slow-J commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775353027

   If we want to remove the patching entirely, which Lucene version (and which Codec) should we implement this in? Would this be a potential change for Lucene 9.9 or perhaps 10.0?
   
   Are there any additional corpora that we should also test this with?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775871779

   > Maybe write something in the index header to indicate if patching is there (default to yes - in 9.x ). Then new indexes will write additional header to indicate there is not patching in the whole index (10.x behavior). Then remove the header as well as the patching support altogether in 11.x.
   
   @Tony-X would the goal here be to eliminate overhead of having to read the number of patches when decoding each block? Or is there more to it than that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "rmuir (via GitHub)" <gi...@apache.org>.

rmuir commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1773935712

   Should we just do more tests and start writing indexes without patching? Only a 4 percent disk savings? It is a lot of complexity, especially to vectorize. A runtime option is more expensive because then we have to make sure indexes encoded both ways can be read, it only adds more complexity imo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1774236604

   > Should we just do more tests and start writing indexes without patching? Only a 4 percent disk savings? It is a lot of complexity, especially to vectorize. A runtime option is more expensive because then we have to make sure indexes encoded both ways can be read, it only adds more complexity imo
   
   +1 to remove patching entirely!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "msokolov (via GitHub)" <gi...@apache.org>.

msokolov commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775716306

   > Hmm, can you elaborate how it can be fully backwards-compatible on with the indexes that have patching?
   
   I think the idea is that because we always maintain readers that can read prior index versions, we will be able to read the old patched indexes, but we would only write new ones lacking patching (which might not be readable by old readers, i.e. forwards-compatible, if we changed the metadata).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775717147

I like the idea of removing the complexity associated with patching if we're convinced it's the right trade-off (and +1 to the pain of vectorizing with patching going away).

Also +1 to releasing with 9.9 and not waiting. To address the back-compat questions a little bit, along with some of Mike's earlier questions, the number of patches is written into the same byte used to encode the number of bits-per-value used for the FOR block. We only need 5 bits to encode the b.p.v., and we reserve the remaining 3 to encode the number of patches (which is why we have an upper bound of 7 patches currently). So we can always "write" 0 patches in this byte and remain fully backwards compatible on read, which is great (but it also means we can't claw back some saving by getting rid of the space in the index needed to encode the patches... to answer Mike's earlier question a bit).

So I think the roll out might look something like this:
1. In 9.x, always write zero patches when creating the index. Leave the code in place to handle patches if present.
2. In 10.x, continue to support patches on read (as in 9.x) so 10.x remains compatible with 9.x indices.
3. In 11.0, remove all patching logic which will, a) simplify the code a bit, and b) remove the (likely minor) overhead on read of looking up the number of patches in a block, which is always 0.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775725064

   > +1. I recalled that @gsmiller was playing with some SIMD algos for decoding blocks of delta-encoded ints. Even if that is fruitful it'd be tricky to apply it because of the patching.
   
   Yes, that's right. There was some experimentation with [this algorithm](https://en.algorithmica.org/hpc/algorithms/prefix/) for vectorized prefix sum.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.

Tony-X commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775993940

   > would the goal here be to eliminate overhead of having to read the number of patches when decoding each block?
   
   Yes. This means we could know upfront at segment opening time which code to use (FOR or PatchedFOR), as opposed to use PatchedFOR code to deal with blocks that don't have patches. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770461719

   Another exciting optimization such a "patch-less" encoding could implement is within-block skipping (I believe Tantivy does this).
   
   Today, our skipper is forced to align to block boundaries, so when we skip to a given `docid`, we go to the block that may contain this `docid`, decode all 128 `int[]`, then linearly scan within those 128 ints.  This is quite a bit of overhead for each skip request!
   
   If we could lower that linear scan cost to maybe 16 or 8 or something, the conjunctive queries should get even faster.  But perhaps it becomes trickier to take advantage of SIMD optimizations if we are decoding a subset of ints, not sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "jpountz (via GitHub)" <gi...@apache.org>.

jpountz commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1782814872

   FWIW I could reproduce the speedup from disabling patching locally on wikibigall:
   
   ```
                               TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                   HighSloppyPhrase        1.58      (4.9%)        1.55      (9.5%)   -1.9% ( -15% -   13%) 0.441
                          CountTerm    13139.34      (3.0%)    12937.65      (3.1%)   -1.5% (  -7% -    4%) 0.111
                  HighTermMonthSort     4393.07      (1.5%)     4371.13      (2.1%)   -0.5% (  -4% -    3%) 0.396
                            Prefix3      270.78      (3.4%)      271.17      (3.2%)    0.1% (  -6% -    7%) 0.889
                             Fuzzy1      100.16      (1.1%)      100.42      (0.9%)    0.3% (  -1% -    2%) 0.420
                             Fuzzy2       70.62      (1.1%)       70.84      (1.0%)    0.3% (  -1% -    2%) 0.340
                         HighPhrase       16.41      (3.8%)       16.49      (5.3%)    0.5% (  -8% -    9%) 0.738
                            Respell       53.35      (1.8%)       53.62      (1.6%)    0.5% (  -2% -    4%) 0.351
                           HighTerm      418.58      (9.3%)      421.91      (7.7%)    0.8% ( -14% -   19%) 0.770
                    LowSloppyPhrase        9.87      (2.5%)        9.96      (5.9%)    0.9% (  -7% -    9%) 0.544
                           Wildcard       94.17      (2.8%)       95.04      (3.3%)    0.9% (  -5% -    7%) 0.341
                            MedTerm      553.18      (8.3%)      559.16      (6.9%)    1.1% ( -13% -   17%) 0.656
                            LowTerm      784.38      (7.0%)      793.23      (5.7%)    1.1% ( -10% -   14%) 0.575
                           PKLookup      264.32      (2.8%)      267.44      (2.2%)    1.2% (  -3% -    6%) 0.138
                       HighSpanNear        5.37      (3.2%)        5.44      (3.3%)    1.3% (  -5% -    8%) 0.213
                          OrHighLow      590.78      (3.0%)      598.60      (2.6%)    1.3% (  -4% -    7%) 0.132
                          LowPhrase       27.97      (3.7%)       28.42      (4.9%)    1.6% (  -6% -   10%) 0.245
                        LowSpanNear       11.09      (2.1%)       11.30      (2.2%)    1.8% (  -2% -    6%) 0.007
                        MedSpanNear        7.39      (3.3%)        7.53      (3.5%)    1.9% (  -4% -    8%) 0.079
                         AndHighLow      819.00      (2.9%)      835.98      (2.4%)    2.1% (  -3% -    7%) 0.015
                          MedPhrase       91.27      (3.4%)       93.33      (4.6%)    2.3% (  -5% -   10%) 0.078
              HighTermDayOfYearSort      447.47      (1.8%)      457.76      (1.8%)    2.3% (  -1% -    5%) 0.000
                    MedSloppyPhrase       16.79      (2.4%)       17.25      (4.6%)    2.7% (  -4% -    9%) 0.017
                          OrHighMed      157.99      (2.2%)      162.38      (2.3%)    2.8% (  -1% -    7%) 0.000
                         OrHighHigh       67.77      (1.7%)       69.71      (1.9%)    2.9% (   0% -    6%) 0.000
                        AndHighHigh       48.92      (1.8%)       50.67      (2.1%)    3.6% (   0% -    7%) 0.000
                         AndHighMed      174.71      (2.3%)      181.03      (2.6%)    3.6% (  -1% -    8%) 0.000
                   CountAndHighHigh       38.11      (4.4%)       39.65      (5.1%)    4.1% (  -5% -   14%) 0.007
                    CountOrHighHigh       56.07     (16.2%)       58.48     (18.5%)    4.3% ( -26% -   46%) 0.435
                     CountOrHighMed       86.95     (16.0%)       90.94     (18.3%)    4.6% ( -25% -   46%) 0.398
                    CountAndHighMed      116.35      (3.4%)      121.78      (4.8%)    4.7% (  -3% -   13%) 0.000
                        CountPhrase        3.26     (11.2%)        3.43     (13.1%)    5.1% ( -17% -   33%) 0.187
                             IntNRQ      146.09     (35.1%)      171.55     (36.7%)   17.4% ( -40% -  137%) 0.124
   ```
   
   `.doc` files were 12% larger overall (2.64GB to 2.96GB), `.pos` files were 11% larger (11.03GB to 12.24GB), and the index was 9.7% larger (15.66GB to 17.18GB).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1786949852

Thanks for testing @jpountz.

I think at some point we also enabled patching for the freq blocks inside `.doc` file too?

Normally the `IntNRQ` (1D points numeric range query) is very noisy, but maybe this gain is real? p-value seems to think it could be close to real?

The conjunctive and disjunctive gains are awesome.

> Regarding backward compatibility, let's do it with codecs as usual: fork `Lucene90PostingsFormat` into a new `Lucene99PostingsFormat` that uses PFOR for postings. Then the codec infrastructure will make sure to keep using the old postings format for existing segments and the new postings format for new segments (including merged ones).

> I have a preference for keeping PFOR for positions and only moving postings to FOR (essentially reverting https://github.com/apache/lucene/pull/69).

This change might make SIMD decoding more palatable for the unpached `int[]` blocks (doc, freq).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1769095158

   These results are really interesting! As another option, I wonder if it's worth thinking about this problem as a new codec (sandbox module to start?) that biases towards query speed instead of index size? There may be other decisions we would make differently in a codec biasing in that direction beyond FOR patching. I dunno... maybe that's a terrible idea with a "slippery slope" problem. But I also worry a bit about adding configuration switches to a default codec that need to be maintained. One-size-fits-all solutions are indeed challenging... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "slow-J (via GitHub)" <gi...@apache.org>.

slow-J commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1771275256

   >Did you turn off patching for all encoded int[] blocks (docs, freqs, positions)?
   
   Yes, I think so. All uses of `pforUtil` in the postingsReader and writer were replaced with the no patching util.
   
   > I'm curious: did you just force no patching at write time, but still write a header into each block saying "there are 0 patches"? If so, we could save a bit of space by removing that header entirely since it'll always be 0), and perhaps gain a bit of performance by not having to check that header at read time.
   
   I essentially set the number of exceptions to 0 and removed related logic, still keeping the header with 0 patches. I'll iterate and remove this to see the difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770452771

   That's a neat idea (separate codec that trades off index size for faster search performance).  Maybe it could also fold in the [fully in RAM FST term dictionary](https://github.com/apache/lucene/pull/12688) that @Tony-X is working on, if that is a nice speedup.
   
   But some of our Codec formats, e.g. for stored fields, have two options to make the tradeoff an explicit choice by the user (`BEST_COMPRESSION` vs `BEST_SPEED`).  Maybe, if this new sandbox Codec works out, some of its tradeoffs could be folded into the default Codec with similar constants to make the tradeoff explicit.
   
   Also, "no patching" is something we already support at read-time since some blocks today will legitimately have no patching.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770457453

   > Posting results below
   
   The results are impressive!  Conjunctive (-like) queries see sizable gains.
   
   Did you turn off patching for all encoded `int[]` blocks (docs, freqs, positions)?
   
   > This change would bring a 4.39691% increase in index size.
   
   I'm curious: did you just force no patching at write time, but still write a header into each block saying "there are 0 patches"?  If so, we could save a bit of space by removing that header entirely since it'll always be 0), and perhaps gain a bit of performance by not having to check that header at read time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1786989074

   This change would also make it easier to [skip within blocks](https://github.com/apache/lucene/issues/12486), which might be a sizable win for conjunctions and NOT clauses.
   
   > I'm not sure how it could not be noise, since it never needs to decode postings?
   
   Oh, you are right!  It's entirely handled by the BKD tree.  Phew.  How noisy this query is indeed!  Hmm how do we encode the `int[]` blocks in BKD ...
   
   > Maybe we should also look into the trade-off of keeping freqs on PFOR vs. switching them to FOR, given that:
   
   +1 -- maybe we leave freqs with patching, and only remove patching for doc blocks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand closed issue #12696: Adding option to codec to disable patching in Lucene's PFOR encoding
URL: https://github.com/apache/lucene/issues/12696


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775638986

   > Are there any additional corpora that we should also test this with?
   
   Maybe the NYC taxis?  This is a more sparse, and tiny docs (vs dense and medium/large docs in `enwiki`).  The tooling for indexing the NYC taxis corpus is already in `luceneutil` (it runs nightly: https://home.apache.org/~mikemccand/lucenebench/sparseResults.html).  This is a nice counter-point to `enwiki`.
   
   > Would this be a potential change for Lucene 9.9 or perhaps 10.0?
   
   That's a good question.  It is a very low level index format change, and no API change.  It would be fully back-compat whether we release in 9.9 vs 10.0.  I don't see why we should withhold the change until 10.0, so maybe 9.9?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.

Tony-X commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775698779

   > It is a lot of complexity, especially to vectorize.
   
   +1. I recalled that @gsmiller was playing with some SIMD algos for decoding blocks of delta-encoded ints. Even if that is fruitful it'd be tricky to apply it because of the patching. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org