You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Tony-X (via GitHub)" <gi...@apache.org> on 2023/10/16 22:11:07 UTC

[PR] Random access term dictionary [lucene]

Tony-X opened a new pull request, #12688:
URL: https://github.com/apache/lucene/pull/12688

   ### Description
   Related issue https://github.com/apache/lucene/issues/12513
   
   Opening this PR early to avoid massive diffs in one-shot
   
   - [x]  Encode (term type, local ord) in FST
   
   TODO:
   - [ ] Implement bit-packing and unpacking for each term type
   - [ ] Implement the PostingsFormat
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1773113866

   Thanks @bruno-roustant ! If you're okay to share it feel free to share it here. 
   
   I'm in the process of baking my own specific implementation (which internally uses a single long as bit buffer), but I might absorb some interesting ideas from your impl. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1857417012

   Here is the even more interesting stuff. After all those allocation optimizations. I also implemented the on-paper more "efficient" algorithm to intersect FST and FSA for Terms.intersect(), which leverages the sorted nature of the FST arcs and FSA transitions from a given state (so at least we can binary search to advance with some skipping). FST in some cases have direct addressing which is exploited, too. As a side note -- it also uncovered a bug for the NFA impl which is tracked here https://github.com/apache/lucene/issues/12906.  
   
   But that change is not moving the needle at all for `WildCard` and `Prefix3` search tasks. 
   
   
   ```
   Before 
                               TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                           Wildcard       62.85      (1.8%)       12.66      (0.6%)  -79.9% ( -80% -  -78%) 0.000
                             Fuzzy2       55.12      (1.0%)       18.92      (0.8%)  -65.7% ( -66% -  -64%) 0.000
                             Fuzzy1       61.20      (0.8%)       22.55      (0.8%)  -63.2% ( -64% -  -62%) 0.000
                            Respell       31.11      (0.8%)       11.95      (0.6%)  -61.6% ( -62% -  -60%) 0.000
                            Prefix3      135.69      (2.0%)       65.06      (0.7%)  -52.1% ( -53% -  -50%) 0.000
                  HighTermTitleSort      119.58      (0.9%)      111.03      (1.7%)   -7.2% (  -9% -   -4%) 0.000
                             IntNRQ       22.25      (1.1%)       21.87      (1.5%)   -1.7% (  -4% -    0%) 0.000
                         HighPhrase       25.82      (3.6%)       25.55      (3.2%)   -1.1% (  -7% -    5%) 0.318
                          MedPhrase        7.41      (2.4%)        7.35      (2.2%)   -0.8% (  -5% -    3%) 0.259
                        LowSpanNear        8.81      (1.9%)        8.74      (2.1%)   -0.8% (  -4% -    3%) 0.202
             OrHighMedDayTaxoFacets        3.86      (5.8%)        3.83      (4.8%)   -0.8% ( -10% -   10%) 0.636
                         TermDTSort      100.75      (2.9%)       99.98      (2.0%)   -0.8% (  -5% -    4%) 0.336
               HighIntervalsOrdered        6.07      (2.1%)        6.03      (2.4%)   -0.7% (  -5% -    3%) 0.342
                MedIntervalsOrdered       45.89      (2.0%)       45.61      (2.4%)   -0.6% (  -4% -    3%) 0.389
                       HighSpanNear       10.73      (1.0%)       10.66      (1.5%)   -0.6% (  -3% -    2%) 0.165
              HighTermDayOfYearSort      206.09      (1.8%)      204.93      (1.9%)   -0.6% (  -4% -    3%) 0.338
                LowIntervalsOrdered        8.39      (2.3%)        8.37      (2.5%)   -0.3% (  -5% -    4%) 0.654
                        MedSpanNear       66.00      (1.3%)       65.81      (1.9%)   -0.3% (  -3% -    2%) 0.574
                            MedTerm      322.61      (4.7%)      321.89      (4.5%)   -0.2% (  -9% -    9%) 0.878
            AndHighMedDayTaxoFacets       22.62      (1.0%)       22.58      (1.2%)   -0.2% (  -2% -    2%) 0.617
                          LowPhrase       48.52      (1.3%)       48.46      (1.4%)   -0.1% (  -2% -    2%) 0.745
                            LowTerm      403.54      (2.9%)      403.22      (2.4%)   -0.1% (  -5% -    5%) 0.923
        BrowseRandomLabelTaxoFacets        3.20      (0.7%)        3.20      (0.9%)   -0.0% (  -1% -    1%) 0.905
           AndHighHighDayTaxoFacets        8.06      (1.3%)        8.06      (1.6%)    0.0% (  -2% -    2%) 0.962
          BrowseDayOfYearTaxoFacets        3.76      (0.6%)        3.76      (0.6%)    0.0% (  -1% -    1%) 0.859
              BrowseMonthTaxoFacets        3.62      (1.0%)        3.62      (1.0%)    0.1% (  -1% -    2%) 0.866
                      OrHighNotHigh      156.16      (6.0%)      156.26      (5.9%)    0.1% ( -11% -   12%) 0.972
               BrowseDateTaxoFacets        3.73      (0.6%)        3.73      (0.6%)    0.1% (  -1% -    1%) 0.722
                      OrNotHighHigh      144.55      (5.1%)      144.68      (4.7%)    0.1% (  -9% -   10%) 0.957
               MedTermDayTaxoFacets       17.57      (2.8%)       17.59      (2.8%)    0.2% (  -5% -    5%) 0.863
          BrowseDayOfYearSSDVFacets        3.22      (0.9%)        3.23      (0.7%)    0.2% (  -1% -    1%) 0.424
                           HighTerm      401.13      (5.4%)      402.30      (5.7%)    0.3% ( -10% -   12%) 0.868
              BrowseMonthSSDVFacets        3.27      (0.7%)        3.28      (1.0%)    0.3% (  -1% -    2%) 0.202
                        AndHighHigh       20.87      (2.8%)       20.94      (2.5%)    0.4% (  -4% -    5%) 0.670
                   HighSloppyPhrase        6.58      (3.3%)        6.61      (3.4%)    0.4% (  -6% -    7%) 0.727
                         AndHighMed       89.34      (1.8%)       89.76      (1.4%)    0.5% (  -2% -    3%) 0.355
               BrowseDateSSDVFacets        0.95      (2.8%)        0.95      (4.0%)    0.5% (  -6% -    7%) 0.656
                       OrNotHighLow      420.17      (2.2%)      422.34      (2.2%)    0.5% (  -3% -    4%) 0.452
                    LowSloppyPhrase        2.89      (2.4%)        2.91      (1.9%)    0.6% (  -3% -    4%) 0.369
                       OrNotHighMed      219.50      (4.5%)      221.02      (4.1%)    0.7% (  -7% -    9%) 0.611
                    MedSloppyPhrase       10.44      (2.2%)       10.52      (1.4%)    0.7% (  -2% -    4%) 0.222
                       OrHighNotLow      288.48      (5.4%)      290.70      (5.7%)    0.8% (  -9% -   12%) 0.663
                          OrHighMed       53.25      (3.6%)       53.66      (3.5%)    0.8% (  -6% -    8%) 0.488
               HighTermTitleBDVSort        2.77      (7.2%)        2.79      (7.0%)    0.9% ( -12% -   16%) 0.699
                       OrHighNotMed      270.38      (5.7%)      272.88      (5.4%)    0.9% (  -9% -   12%) 0.601
                         OrHighHigh       20.86      (5.1%)       21.08      (5.2%)    1.1% (  -8% -   12%) 0.519
                          OrHighLow      220.40      (4.2%)      223.52      (5.6%)    1.4% (  -8% -   11%) 0.367
        BrowseRandomLabelSSDVFacets        2.32      (4.3%)        2.38      (7.6%)    2.3% (  -9% -   14%) 0.240
                         AndHighLow      395.82      (2.9%)      405.36      (2.8%)    2.4% (  -3% -    8%) 0.008
                  HighTermMonthSort     2375.71      (3.7%)     2555.72      (5.0%)    7.6% (  -1% -   16%) 0.000
                           PKLookup      140.60      (1.8%)      178.02      (1.3%)   26.6% (  23% -   30%) 0.000
   
   
   After
                               TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                           Wildcard       37.77      (2.7%)        5.70      (1.3%)  -84.9% ( -86% -  -83%) 0.000
                            Prefix3       52.27      (2.7%)       22.71      (1.8%)  -56.6% ( -59% -  -53%) 0.000
                             Fuzzy1       59.96      (1.6%)       54.80      (2.3%)   -8.6% ( -12% -   -4%) 0.000
                  HighTermTitleSort      106.17      (2.1%)      101.30      (1.5%)   -4.6% (  -7% -   -1%) 0.000
                             Fuzzy2       33.40      (1.3%)       32.14      (1.6%)   -3.8% (  -6% -    0%) 0.000
                            MedTerm      273.84      (5.4%)      265.92      (9.7%)   -2.9% ( -17% -   12%) 0.245
                           HighTerm      349.66      (5.2%)      341.63      (8.9%)   -2.3% ( -15% -   12%) 0.320
                            LowTerm      356.23      (3.1%)      350.27      (4.3%)   -1.7% (  -8% -    5%) 0.156
                    LowSloppyPhrase        4.47      (2.1%)        4.44      (4.6%)   -0.8% (  -7% -    6%) 0.492
                       HighSpanNear        8.12      (2.1%)        8.06      (2.5%)   -0.7% (  -5% -    4%) 0.331
                    MedSloppyPhrase       31.05      (3.6%)       30.83      (4.1%)   -0.7% (  -8% -    7%) 0.559
                MedIntervalsOrdered        3.88      (3.3%)        3.86      (3.3%)   -0.7% (  -7% -    6%) 0.519
                        MedSpanNear        8.94      (1.2%)        8.88      (1.6%)   -0.7% (  -3% -    2%) 0.126
                LowIntervalsOrdered        7.40      (3.3%)        7.35      (3.4%)   -0.7% (  -7% -    6%) 0.537
                        LowSpanNear       29.33      (2.0%)       29.15      (2.2%)   -0.6% (  -4% -    3%) 0.374
                   HighSloppyPhrase        6.68      (3.6%)        6.64      (3.5%)   -0.5% (  -7% -    6%) 0.624
               MedTermDayTaxoFacets        9.14      (3.0%)        9.11      (8.2%)   -0.3% ( -11% -   11%) 0.861
                         HighPhrase      115.62      (3.9%)      115.24      (4.0%)   -0.3% (  -7% -    7%) 0.798                                                [162/1927]
                        AndHighHigh       13.95      (4.2%)       13.92      (4.5%)   -0.3% (  -8% -    8%) 0.847
              BrowseMonthSSDVFacets        3.30      (0.8%)        3.29      (1.0%)   -0.3% (  -2% -    1%) 0.377
                         AndHighMed       85.40      (2.0%)       85.18      (2.1%)   -0.2% (  -4% -    3%) 0.695
                             IntNRQ       16.65      (4.1%)       16.63      (3.8%)   -0.1% (  -7% -    8%) 0.914
        BrowseRandomLabelSSDVFacets        2.30      (0.9%)        2.30      (1.1%)   -0.1% (  -2% -    1%) 0.754
                         OrHighHigh       24.99      (6.1%)       24.97      (5.3%)   -0.1% ( -10% -   12%) 0.957
           AndHighHighDayTaxoFacets        2.29      (2.8%)        2.29      (2.4%)   -0.0% (  -5% -    5%) 0.977
            AndHighMedDayTaxoFacets       40.17      (1.4%)       40.20      (1.4%)    0.1% (  -2% -    2%) 0.872
             OrHighMedDayTaxoFacets        3.15      (3.9%)        3.15      (3.4%)    0.1% (  -6% -    7%) 0.946
                          LowPhrase       30.23      (2.5%)       30.26      (2.4%)    0.1% (  -4% -    5%) 0.911
                       OrNotHighLow      201.78      (2.9%)      202.03      (3.0%)    0.1% (  -5% -    6%) 0.896
        BrowseRandomLabelTaxoFacets        3.20      (3.2%)        3.21      (4.1%)    0.1% (  -6% -    7%) 0.899
               HighIntervalsOrdered        0.42      (1.9%)        0.42      (1.6%)    0.2% (  -3% -    3%) 0.699
                       OrNotHighMed      235.49      (5.5%)      236.01      (4.7%)    0.2% (  -9% -   11%) 0.892
              BrowseMonthTaxoFacets        3.62      (1.0%)        3.63      (1.0%)    0.2% (  -1% -    2%) 0.477
                      OrNotHighHigh      329.77      (4.9%)      330.79      (5.5%)    0.3% (  -9% -   11%) 0.851
                          MedPhrase       35.79      (3.4%)       35.90      (3.4%)    0.3% (  -6% -    7%) 0.771
                         TermDTSort      112.10      (3.2%)      112.45      (3.4%)    0.3% (  -6% -    7%) 0.763
               BrowseDateSSDVFacets        0.97      (7.1%)        0.98     (10.0%)    0.4% ( -15% -   18%) 0.897
          BrowseDayOfYearSSDVFacets        3.21      (2.2%)        3.22      (1.6%)    0.4% (  -3% -    4%) 0.525
              HighTermDayOfYearSort      235.24      (2.1%)      236.16      (1.6%)    0.4% (  -3% -    4%) 0.512
                          OrHighMed       70.60      (3.3%)       70.99      (2.7%)    0.5% (  -5% -    6%) 0.571
                         AndHighLow      370.31      (3.2%)      372.60      (3.4%)    0.6% (  -5% -    7%) 0.559
               HighTermTitleBDVSort        5.53      (4.1%)        5.56      (4.5%)    0.6% (  -7% -    9%) 0.648
                       OrHighNotLow      263.18      (5.6%)      264.95      (6.2%)    0.7% ( -10% -   13%) 0.717
                       OrHighNotMed      222.41      (5.8%)      224.06      (5.8%)    0.7% ( -10% -   13%) 0.688
                      OrHighNotHigh      233.04      (5.5%)      234.89      (5.8%)    0.8% (  -9% -   12%) 0.657
                          OrHighLow      463.17      (3.0%)      466.91      (3.1%)    0.8% (  -5% -    7%) 0.403
          BrowseDayOfYearTaxoFacets        3.77      (0.6%)        3.84      (8.9%)    1.9% (  -7% -   11%) 0.342
               BrowseDateTaxoFacets        3.74      (0.6%)        3.81      (8.8%)    1.9% (  -7% -   11%) 0.332
                  HighTermMonthSort     2350.73      (4.0%)     2477.32      (4.5%)    5.4% (  -2% -   14%) 0.000
                            Respell       30.81      (1.5%)       34.87      (1.7%)   13.2% (   9% -   16%) 0.000
                           PKLookup      141.03      (1.9%)      177.54      (2.0%)   25.9% (  21% -   30%) 0.000
   
   ```
   
   I tried to modify the bench task file and only run `WildCard` to understand where the time is spent.
   
   ### My version
   ![image](https://github.com/apache/lucene/assets/7917591/cf0589da-befd-4250-a4fb-6d2e4f65c0cf)
   
   ### mainline
   ![image](https://github.com/apache/lucene/assets/7917591/dd8b820d-4471-429f-909d-55b1d5ff35d9)
   
   
   So we can see that the most time is spent in actually reading out the FST arcs and FSA transitions... My intuitive explanation for why this is slower than the blocktree is that it has worse locality in its data access pattern. (@mikemccand maybe you can shed some light) Here are some relevant factors:
   * The FST is larger as it contains all terms. So there are more Arcs to visit. Blocktree (main) use the FST to index prefixes.
   * When binary-searching or directly address Arc/Transitions the target is somewhat random. 
   * The FST bytes are read backwards. (probably less of an issue if we read sequentially on modern HW)
   * Blocktree at a given node reads bytes sequentially and terms are sorted, too.
   
   
   Just out of curiosity I altered my code to load the FST on-heap to compare with the default off-heap option. It did not help much with `Wildcard` but PKLookup got substantially faster!
   
   The PKLookup task is a great proxy to FST performance, as both versions of the code visits the exact same number of Arcs. 
   
   ```
   
   Off heap
                           Wildcard       47.56      (1.7%)       10.13      (0.4%)  -78.7% ( -79% -  -77%) 0.000
                           PKLookup      136.03      (2.4%)      147.93      (2.3%)    8.8% (   3% -   13%) 0.000
   
   on heap
                           Wildcard       37.11      (1.5%)        8.35      (0.3%)  -77.5% ( -78% -  -76%) 0.000
                           PKLookup      136.04      (3.3%)      269.60      (9.0%)   98.2% (  83% -  114%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391564228


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler<Long> fstCompiler =
+      new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+  TermsIndexBuilder() {
+    Arrays.fill(countPerType, -1);
+  }
+
+  public void addTerm(BytesRef term, TermType termType) throws IOException {
+    IntsRefBuilder scratchInts = new IntsRefBuilder();
+    long ord = ++countPerType[termType.getId()];
+    fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+  }
+
+  public TermsIndex build() throws IOException {
+    return new TermsIndex(fstCompiler.compile());
+  }
+
+  private long encode(long ord, TermType termType) {
+    // use a single long to encode `ord` and `termType`
+    // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+    // so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+    // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+    if (ord < 0) {
+      throw new IllegalArgumentException("can't encode negative ord");
+    }
+    if (ord > MAX_ORD) {
+      throw new IllegalArgumentException(
+          "Input ord is too large for TermType: "

Review Comment:
   I see. That's a good point. I'll keep that in mind when implementing the main FieldsConsumer!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391027915


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler<Long> fstCompiler =
+      new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+  TermsIndexBuilder() {
+    Arrays.fill(countPerType, -1);
+  }
+
+  public void addTerm(BytesRef term, TermType termType) throws IOException {
+    IntsRefBuilder scratchInts = new IntsRefBuilder();
+    long ord = ++countPerType[termType.getId()];
+    fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+  }
+
+  public TermsIndex build() throws IOException {
+    return new TermsIndex(fstCompiler.compile());
+  }
+
+  private long encode(long ord, TermType termType) {
+    // use a single long to encode `ord` and `termType`
+    // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+    // so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+    // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+    if (ord < 0) {
+      throw new IllegalArgumentException("can't encode negative ord");
+    }
+    if (ord > MAX_ORD) {
+      throw new IllegalArgumentException(
+          "Input ord is too large for TermType: "

Review Comment:
   Well, maybe something like `Can only index XXX unique terms, but saw YYY terms for field ZZZ` or so?  (If that's what this error message really means to the user).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381991598


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;

Review Comment:
   Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUPUT` value for the FST. so we peel off 4 bits in total.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381984403


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecImpl.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+import org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {
+  private final TermStateCodecComponent[] components;
+  private final int metadataBytesLength;
+
+  private static int getMetadataLength(TermStateCodecComponent component) {
+    // 1 byte for bitWidth; optionally 8 byte more for the reference value
+    return 1 + (component.isMonotonicallyIncreasing() ? 8 : 0);
+  }
+
+  public TermStateCodecImpl(TermStateCodecComponent[] components) {
+    assert components.length > 0;
+
+    this.components = components;
+    int metadataBytesLength = 0;
+    for (var component : components) {
+      metadataBytesLength += getMetadataLength(component);
+    }
+    this.metadataBytesLength = metadataBytesLength;
+  }
+
+  @Override
+  public byte[] encodeBlock(IntBlockTermState[] inputs, BitPacker bitPacker) {
+    Metadata[] metadataPerComponent = getMetadataPerComponent(inputs);
+    byte[] metadataBytes = serializeMetadata(metadataPerComponent);
+
+    // Encode inputs via the bitpacker
+    for (var termState : inputs) {
+      encodeOne(bitPacker, termState, metadataPerComponent);
+    }
+
+    return metadataBytes;
+  }
+
+  private Metadata[] getMetadataPerComponent(IntBlockTermState[] inputs) {
+    Metadata[] metadataPerComponent = new Metadata[components.length];
+    for (int i = 0; i < components.length; i++) {
+      var component = components[i];
+      byte bitWidth = TermStateCodecComponent.getBitWidth(inputs, component);
+      long referenceValue =
+          component.isMonotonicallyIncreasing() ? component.getTargetValue(inputs[0]) : 0L;
+      metadataPerComponent[i] = new Metadata(bitWidth, referenceValue);
+    }
+    return metadataPerComponent;
+  }
+
+  private byte[] serializeMetadata(Metadata[] metadataPerComponent) {
+    byte[] metadataBytes = new byte[this.metadataBytesLength];
+    ByteArrayDataOutput dataOut = new ByteArrayDataOutput(metadataBytes);
+
+    for (int i = 0; i < components.length; i++) {
+      var metadata = metadataPerComponent[i];
+      dataOut.writeByte(metadata.bitWidth);
+      if (components[i].isMonotonicallyIncreasing()) {
+        dataOut.writeLong(metadata.referenceValue);
+      }
+    }
+    return metadataBytes;
+  }
+
+  private void encodeOne(

Review Comment:
   Yes, since the common access pattern is to grab all interesting states for one term as oppose to "I want to get all posting start FP for many terms" 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "dungba88 (via GitHub)" <gi...@apache.org>.
dungba88 commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1438004582


##########
lucene/codecs/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static final long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler<Long> fstCompiler;
+
+  TermsIndexBuilder() throws IOException {
+    fstCompiler =

Review Comment:
   Just FYI, you can plug the `indexOutput` directly to the FSTCompiler (passing from RandomAccessTermsDictWriter), and make the FST writing entirely off-heap (apart from the heap, it also eliminates the time taken to write to the on-heap DataOutput). I have some example PRs at:
   - https://github.com/apache/lucene/pull/12985
   - https://github.com/apache/lucene/pull/12980



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "nknize (via GitHub)" <gi...@apache.org>.
nknize commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1382078005


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########


Review Comment:
   > I'm in the process of building the real compact bit packer.
   
   +1. I'm assuming that will be added to this PR?
   
   
   
   > The goal here is to pack a sequence of values that have different bitwidths... We can't use VLong either since we aim to write fixed size record, so that we can do random access
   
   Hmm.. I'll have to look deeper at this. The reason I ask is because I did a similar bit packing w/ "random access" when serializing the [ShapeDocValues binary tree](https://github.com/apache/lucene/blob/d6836d3d0e5d33a98b35c0885b9787f46c4be47e/lucene/core/src/java/org/apache/lucene/document/ShapeDocValues.java#L462C26-L462C26) and it feels like we often re-implement this logic in different forms for different use cases. Can we generalize this and lean out our code base to make it more useable and readable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391032959


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodec.java:
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.util.BytesRef;
+
+interface TermStateCodec {

Review Comment:
   Hmm maybe don't name it `Codec` since it makes it seem like it could be a normal Lucene `Codec`?  But this is much lower level ... maybe `TermStateEncoder`?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodecImpl.java:
##########
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocFreq;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.LastPositionBlockOffset;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PayloadStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PositionStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SingletonDocId;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SkipOffset;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.TotalTermFreq;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {

Review Comment:
   Do we really need the separable interface and impl yet?  Are there more than one impl we are exploring?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/bitpacking/FixedSizeByteArrayBitPacker.java:
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking;
+
+/**
+ * A {@link BitPacker} implementation that requires user to know the size of the resulting byte
+ * array upfront, in order to avoid allocation and copying for dynamically growing the array.
+ */
+public final class FixedSizeByteArrayBitPacker extends BitPackerImplBase {
+  private final byte[] bytes;
+  private int numBytesUsed = 0;

Review Comment:
   No need for `= 0` -- it's already java's default.



##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TestTermStateCodecImpl.java:
##########
@@ -0,0 +1,283 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPerBytePacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpackerImpl;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.FixedSizeByteArrayBitPacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.util.BytesRef;
+
+public class TestTermStateCodecImpl extends LuceneTestCase {
+
+  public void testEncodeDecode() {
+    TermStateCodecImpl codec =
+        new TermStateCodecImpl(
+            new TermStateCodecComponent[] {
+              TermStateCodecComponent.DocFreq.INSTANCE, TermStateCodecComponent.DocStartFP.INSTANCE,
+            });
+
+    ArrayList<IntBlockTermState> termStates = new ArrayList<>();
+    long maxDocFreqSeen = -1;
+    long docStartFPBase = random().nextLong(Long.MAX_VALUE >> 1);
+    long maxDocStartFPDeltaSeen = -1;
+    for (int i = 0; i < random().nextInt(2, 256); i++) {
+      var termState = new IntBlockTermState();
+      termState.docFreq = random().nextInt(1, 1 << random().nextInt(1, 31));
+      if (i == 0) {
+        termState.docStartFP = docStartFPBase;
+      } else {
+        termState.docStartFP = termStates.get(i - 1).docStartFP + random().nextLong(1024);
+        maxDocStartFPDeltaSeen =
+            Math.max(maxDocStartFPDeltaSeen, termState.docStartFP - docStartFPBase);
+      }
+      maxDocFreqSeen = Math.max(maxDocFreqSeen, termState.docFreq);
+      termStates.add(termState);
+    }
+
+    IntBlockTermState[] termStatesArray = termStates.toArray(IntBlockTermState[]::new);
+
+    BitPerBytePacker bitPerBytePacker = new BitPerBytePacker();
+    byte[] metadata = codec.encodeBlock(termStatesArray, bitPerBytePacker);
+
+    // For the metadata, we expect
+    // 0: DocFreq.bitWidth,
+    // 1: DocStartFP.bitWidth,
+    // [2-10]: DocStartFP.referenceValue;
+    int expectedDocFreqBitWidth = 64 - Long.numberOfLeadingZeros(maxDocFreqSeen);
+    int expectedDocStartFPBitWidth = 64 - Long.numberOfLeadingZeros(maxDocStartFPDeltaSeen);
+    assertEquals(10, metadata.length);
+    assertEquals(expectedDocFreqBitWidth, metadata[0]);
+    assertEquals(expectedDocStartFPBitWidth, metadata[1]);
+    ByteArrayDataInput byteArrayDataInput = new ByteArrayDataInput(metadata, 2, 8);
+    assertEquals(docStartFPBase, byteArrayDataInput.readLong());
+
+    // Assert with real bit-packer we get the same bytes
+    FixedSizeByteArrayBitPacker fixedSizeByteArrayBitPacker =
+        new FixedSizeByteArrayBitPacker(bitPerBytePacker.getCompactBytes().length);
+    codec.encodeBlock(termStatesArray, fixedSizeByteArrayBitPacker);
+    assertArrayEquals(bitPerBytePacker.getCompactBytes(), fixedSizeByteArrayBitPacker.getBytes());
+
+    // Assert that each term state is the same after the encode-decode roundtrip.
+    BytesRef metadataBytes = new BytesRef(metadata);
+    BytesRef dataBytes = new BytesRef(bitPerBytePacker.getBytes());
+    assertBlockRoundTrip(termStatesArray, codec, metadataBytes, dataBytes, bitPerBytePacker);
+
+    // With real compact bits instead of bit-per-byte
+    dataBytes = new BytesRef(bitPerBytePacker.getCompactBytes());
+    assertBlockRoundTrip(
+        termStatesArray, codec, metadataBytes, dataBytes, BitUnpackerImpl.INSTANCE);
+
+    // Also test decoding that doesn't begin at the start of the block.
+    int pos = random().nextInt(termStatesArray.length);
+    int startBitIndex = pos > 0 ? random().nextInt(pos) : 0;
+    int recordSize = expectedDocFreqBitWidth + expectedDocStartFPBitWidth;
+    // With bit-per-byte bytes
+    dataBytes =
+        new BytesRef(bitPerBytePacker.getBytes(), pos * recordSize - startBitIndex, recordSize);
+    assertDecodeAt(
+        codec, metadataBytes, dataBytes, bitPerBytePacker, startBitIndex, termStatesArray[pos]);
+
+    // With compact bytes
+    int startByteIndex = pos * recordSize / 8;
+    int endByteIndex = (pos + 1) * recordSize / 8;
+    int length = endByteIndex - startByteIndex + ((pos + 1) * recordSize % 8 == 0 ? 0 : 1);
+    dataBytes = new BytesRef(bitPerBytePacker.getCompactBytes(), startByteIndex, length);
+    assertDecodeAt(
+        codec,
+        metadataBytes,
+        dataBytes,
+        BitUnpackerImpl.INSTANCE,
+        (pos * recordSize) % 8,
+        termStatesArray[pos]);
+  }
+
+  private static void assertDecodeAt(
+      TermStateCodecImpl codec,
+      BytesRef metadataBytes,
+      BytesRef dataBytes,
+      BitUnpacker bitUnpacker,
+      int startBitIndex,
+      IntBlockTermState termState) {
+    IntBlockTermState decoded =
+        codec.decodeAt(metadataBytes, dataBytes, bitUnpacker, startBitIndex);
+    assertEquals(termState.docFreq, decoded.docFreq);
+    assertEquals(termState.docStartFP, decoded.docStartFP);
+  }
+
+  private static void assertBlockRoundTrip(
+      IntBlockTermState[] termStatesArray,
+      TermStateCodecImpl codec,
+      BytesRef metadataBytes,
+      BytesRef dataBytes,
+      BitUnpacker bitUnpacker) {
+    for (int i = 0; i < termStatesArray.length; i++) {
+      IntBlockTermState decoded = codec.decodeWithinBlock(metadataBytes, dataBytes, bitUnpacker, i);
+      assertEquals(termStatesArray[i].docFreq, decoded.docFreq);
+      assertEquals(termStatesArray[i].docStartFP, decoded.docStartFP);
+    }
+  }
+
+  public void testGetCodec() {
+    for (IndexOptions indexOptions : IndexOptions.values()) {
+      if (indexOptions == IndexOptions.NONE) {
+        continue;
+      }
+      for (int i = 0; i < 8; i++) {
+        if ((i & 0b011) == 0b011) {
+          continue;
+        }
+        if ((i & 0b100) == 0b100
+            && indexOptions.ordinal() < IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+          continue;
+        }
+        TermType termType = TermType.fromId(i);
+        var expected = getExpectedCodec(termType, indexOptions);
+        var got = TermStateCodecImpl.getCodec(termType, indexOptions);
+        assertEquals(expected, got);
+      }
+    }
+  }
+
+  // Enumerate the expected Codec we get for (TermType, IndexOptions) pairs.
+  static TermStateCodecImpl getExpectedCodec(TermType termType, IndexOptions indexOptions) {
+    ArrayList<TermStateCodecComponent> components = new ArrayList<>();
+    // Wish I can code this better in java...
+    switch (termType.getId()) {
+        // Not singleton doc; No skip data; No last position block offset
+      case 0b000 -> {
+        assert !termType.hasLastPositionBlockOffset()
+            && !termType.hasSkipData()
+            && !termType.hasSingletonDoc();
+        components.add(TermStateCodecComponent.DocStartFP.INSTANCE);
+        if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS.ordinal()) {
+          components.add(TermStateCodecComponent.DocFreq.INSTANCE);
+        }
+        if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+          components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+          components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        }
+        if (indexOptions.ordinal()
+            >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+        // Singleton doc; No skip data; No last position block offset
+      case 0b001 -> {
+        assert !termType.hasLastPositionBlockOffset()
+            && !termType.hasSkipData()
+            && termType.hasSingletonDoc();
+        components.add(TermStateCodecComponent.SingletonDocId.INSTANCE);
+        // If field needs frequency, we need totalTermsFreq.
+        // Since there is only 1 doc, totalTermsFreq == docFreq.
+        if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS.ordinal()) {
+          components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+        }
+        if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+          components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        }
+        if (indexOptions.ordinal()
+            >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+
+        // Not Singleton doc; Has skip data; No last position block offset
+      case 0b010 -> {
+        assert !termType.hasLastPositionBlockOffset()
+            && termType.hasSkipData()
+            && !termType.hasSingletonDoc();
+        components.add(TermStateCodecComponent.DocStartFP.INSTANCE);
+        components.add(TermStateCodecComponent.SkipOffset.INSTANCE);
+        if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS.ordinal()) {
+          components.add(TermStateCodecComponent.DocFreq.INSTANCE);
+        }
+        if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+          components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+          components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        }
+        if (indexOptions.ordinal()
+            >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+        // Singleton doc but has skip data; Invalid state.
+      case 0b011, 0b111 -> {
+        assert termType.hasSkipData() && termType.hasSingletonDoc();
+        throw new IllegalStateException(
+            "Unreachable. A term has skip data but also only has one doc!? Must be a bug");
+      }
+        // Not singleton doc; No skip data; Has last position block offset;
+      case 0b100 -> {
+        assert termType.hasLastPositionBlockOffset()
+            && !termType.hasSkipData()
+            && !termType.hasSingletonDoc();
+        assert indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal();
+        components.add(TermStateCodecComponent.DocStartFP.INSTANCE);
+        components.add(TermStateCodecComponent.DocFreq.INSTANCE);
+        components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+        components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        components.add(TermStateCodecComponent.LastPositionBlockOffset.INSTANCE);
+        if (indexOptions.ordinal()
+            >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+        // Singleton doc; No skip data; Has last position block offset;
+      case 0b101 -> {
+        assert termType.hasLastPositionBlockOffset()
+            && !termType.hasSkipData()
+            && termType.hasSingletonDoc();
+        assert indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal();
+        components.add(TermStateCodecComponent.SingletonDocId.INSTANCE);
+        components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+        components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+        components.add(TermStateCodecComponent.LastPositionBlockOffset.INSTANCE);
+        if (indexOptions.ordinal()
+            >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+          components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+        }
+      }
+        // Not singleton doc; Has skip data; Has last position block offset;
+      case 0b110 -> {

Review Comment:
   Oooh I like the binary-valued case switches!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1860623376

   Theres are very interesting results @Tony-X!  I'll try to give deeper response soon, but one idea that jumped out about `Wildcard` is that BlockTree somewhere takes advantage of `commonSuffixBytes` or so?   This is a `BytesRef` that is non-empty when all strings matched by the `Automaton` share some common suffix, as would happen with a wildcard query like `ab*cd` (`cd` would be the common suffix).
   
   Maybe block tree is able to use this opto more effectively than the "all terms in FST" approach?  But I think you could implement such an opto too, maybe: just find the one node (or, maybe more than one) whose suffix is the common suffix, and fast-check somehow?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381990655


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.util.Objects;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+/**
+ * TermType holds the classification of a term, based on how its postings are written.
+ *
+ * <p>It captures -- 1) if a term has a singleton docid (i.e. only one doc contains this term). 2)
+ * if the term has skip data. 3) if the term as an VINT encoded position block.
+ */
+final class TermType {
+  private static final byte SINGLETON_DOC_MASK = (byte) 1;
+
+  private static final byte HAS_SKIP_DATA_MASK = (byte) 1 << 1;
+
+  private static final byte HAS_VINT_POSITION_BLOCK_MASK = (byte) 1 << 2;
+
+  public static final int NUM_TOTAL_TYPES = 8;

Review Comment:
   Hmm, 8 because even for the most demanding  indexing options which asks for everything: doc, freq, position, payload and offset, there are three optional states singleton docid, skip offset and the last VINT position block offset.
   
   For lighter indexing options, it is possible that the terms produced will have less than 8 types. For example, if position is not needed then we know we never have last VINT position block offset. So there can only be up to 4 types. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381991598


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;

Review Comment:
   Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUPUT` value for the FST. 



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler<Long> fstCompiler =
+      new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+  TermsIndexBuilder() {
+    Arrays.fill(countPerType, -1);
+  }
+
+  public void addTerm(BytesRef term, TermType termType) throws IOException {
+    IntsRefBuilder scratchInts = new IntsRefBuilder();
+    long ord = ++countPerType[termType.getId()];
+    fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+  }
+
+  public TermsIndex build() throws IOException {
+    return new TermsIndex(fstCompiler.compile());
+  }
+
+  private long encode(long ord, TermType termType) {
+    // use a single long to encode `ord` and `termType`
+    // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+    // so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+    // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+    if (ord < 0) {
+      throw new IllegalArgumentException("can't encode negative ord");
+    }
+    if (ord > MAX_ORD) {
+      throw new IllegalArgumentException(
+          "Input ord is too large for TermType: "

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1770448647

   I'll try to review this soon -- it sounds compelling @Tony-X!  I like how it is inspired by Tantivy's term dictionary format (which holds all terms + their metadata in RAM).
   
   Also, with the upcoming ability to [cleanly limit how much RAM the `FSTCompiler` is allowed to use to reduce the size of the FST](https://github.com/apache/lucene/pull/12633), this approach becomes more feasible.  Without that change, the FST compilation might easily use excessive RAM during indexing when merging large segments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1830166075

   > This is reasonable as the terms index (FST) holds all the terms.
   
   +1, nice!
   
   > #### Fuzzy/Wildcard/Prefix queries got _much slower_
   > This is also expected because currently I used the default implementation provided by `TermsEnum` which does not take advantage of the FST. With an optimized implementation I expect it to at least be on-par and slightly better because the FST holds information about all terms, whereas the current BlockTreeTerms only holds prefixes.
   
   OK this makes sense, and it is a (sad) measure of how slow the emulated (on top of `seekCeil`) `.intersect` `TermsEnum` is.  Once you have an optimized version it should likely be faster than block tree since it can intersect all suffixes instead of scanning `byte[]` suffixes in the term block and re-testing each.
   
   > #### `HighTermTitleSort` and `HighTermMonthSort` got about 4.5% ~ 10% less throughput
   > I don't quite understand why term lookup could affect sorting on a DV field
   
   This is odd.  Though, the `HighTermMonthSort` QPS is so crazy high as to not really be trustworthy -- likely BMW is kicking in and saving tons of work.
   
   > #### `AndHighLow` got slower
   >
   > Am i missing some optimization opportunity for low freq terms?
   
   Hmm maybe pulsing?  Are we still inlining single-occurrence terms directly into the terms dict with your new terms dict?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1373799812


##########
lucene/core/src/java/module-info.java:
##########
@@ -35,6 +35,7 @@
   exports org.apache.lucene.codecs.lucene95;
   exports org.apache.lucene.codecs.lucene90.blocktree;
   exports org.apache.lucene.codecs.lucene90.compressing;
+  exports org.apache.lucene.codecs.lucene90.radomaccess;

Review Comment:
   good idea..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381580022


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.util.Objects;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+/**
+ * TermType holds the classification of a term, based on how its postings are written.
+ *
+ * <p>It captures -- 1) if a term has a singleton docid (i.e. only one doc contains this term). 2)
+ * if the term has skip data. 3) if the term as an VINT encoded position block.

Review Comment:
   `term as a` -> `term has a`?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecImpl.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+import org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {
+  private final TermStateCodecComponent[] components;
+  private final int metadataBytesLength;
+
+  private static int getMetadataLength(TermStateCodecComponent component) {
+    // 1 byte for bitWidth; optionally 8 byte more for the reference value
+    return 1 + (component.isMonotonicallyIncreasing() ? 8 : 0);
+  }
+
+  public TermStateCodecImpl(TermStateCodecComponent[] components) {
+    assert components.length > 0;
+
+    this.components = components;
+    int metadataBytesLength = 0;
+    for (var component : components) {
+      metadataBytesLength += getMetadataLength(component);
+    }
+    this.metadataBytesLength = metadataBytesLength;
+  }
+
+  @Override
+  public byte[] encodeBlock(IntBlockTermState[] inputs, BitPacker bitPacker) {
+    Metadata[] metadataPerComponent = getMetadataPerComponent(inputs);
+    byte[] metadataBytes = serializeMetadata(metadataPerComponent);
+
+    // Encode inputs via the bitpacker
+    for (var termState : inputs) {
+      encodeOne(bitPacker, termState, metadataPerComponent);
+    }
+
+    return metadataBytes;
+  }
+
+  private Metadata[] getMetadataPerComponent(IntBlockTermState[] inputs) {
+    Metadata[] metadataPerComponent = new Metadata[components.length];
+    for (int i = 0; i < components.length; i++) {
+      var component = components[i];
+      byte bitWidth = TermStateCodecComponent.getBitWidth(inputs, component);
+      long referenceValue =
+          component.isMonotonicallyIncreasing() ? component.getTargetValue(inputs[0]) : 0L;
+      metadataPerComponent[i] = new Metadata(bitWidth, referenceValue);
+    }
+    return metadataPerComponent;
+  }
+
+  private byte[] serializeMetadata(Metadata[] metadataPerComponent) {
+    byte[] metadataBytes = new byte[this.metadataBytesLength];
+    ByteArrayDataOutput dataOut = new ByteArrayDataOutput(metadataBytes);
+
+    for (int i = 0; i < components.length; i++) {
+      var metadata = metadataPerComponent[i];
+      dataOut.writeByte(metadata.bitWidth);
+      if (components[i].isMonotonicallyIncreasing()) {
+        dataOut.writeLong(metadata.referenceValue);
+      }
+    }
+    return metadataBytes;
+  }
+
+  private void encodeOne(

Review Comment:
   OK it looks like it is not column stride -- each term is packing all of its metadata into one `byte[]` and then these `byte[]` are concatenated for all the terms.



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/Lucene90RandomAccessDictionaryPostingsFormat.java:
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.FieldsConsumer;
+import org.apache.lucene.codecs.FieldsProducer;
+import org.apache.lucene.codecs.PostingsFormat;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import org.apache.lucene.codecs.PostingsWriterBase;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsReader;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsWriter;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Similar to {@link Lucene90PostingsFormat} but with a different term dictionary implementation.

Review Comment:
   It would be nice to spell out some high level details about how the format works :)  E.g. single FST holding full terms, with `long` pointer (packed term type + up to 62 bit ordinal) to where full term metadata is packed on-disk.  And the 8 different term type "groupings".
   
   BTW I think FST will pack better if you put the 3 "term type" bits at the end (lsb) of the `long` instead of msb, because then prefixes can be shared across all term types, instead of just within each term type.  But, this may not be a win because it may take more vLong bytes to write each delta, not sure :)



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.util.Objects;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+/**
+ * TermType holds the classification of a term, based on how its postings are written.
+ *
+ * <p>It captures -- 1) if a term has a singleton docid (i.e. only one doc contains this term). 2)
+ * if the term has skip data. 3) if the term as an VINT encoded position block.
+ */
+final class TermType {
+  private static final byte SINGLETON_DOC_MASK = (byte) 1;
+
+  private static final byte HAS_SKIP_DATA_MASK = (byte) 1 << 1;
+
+  private static final byte HAS_VINT_POSITION_BLOCK_MASK = (byte) 1 << 2;
+
+  public static final int NUM_TOTAL_TYPES = 8;

Review Comment:
   8 because this is all the combinations/permutations of Lucene's different indexing options?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecComponent.java:
##########
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+abstract class TermStateCodecComponent {
+
+  static byte getBitWidth(IntBlockTermState[] termStates, TermStateCodecComponent component) {
+    assert termStates.length > 0;
+
+    long maxValSeen = -1;
+    long referenceValue =
+        component.isMonotonicallyIncreasing() ? component.getTargetValue(termStates[0]) : 0;
+
+    for (var termState : termStates) {
+      maxValSeen = Math.max(maxValSeen, component.getTargetValue(termState) - referenceValue);
+    }
+    return (byte) (64 - Long.numberOfLeadingZeros(maxValSeen));
+  }
+
+  abstract boolean isMonotonicallyIncreasing();
+
+  abstract long getTargetValue(IntBlockTermState termState);
+
+  abstract void setTargetValue(IntBlockTermState termState, long value);
+
+  /** Below are the relevant IntBlockTermState components * */
+  static final class SingletonDocId extends TermStateCodecComponent {
+    public static SingletonDocId INSTANCE = new SingletonDocId();
+
+    private SingletonDocId() {}
+
+    @Override
+    public boolean isMonotonicallyIncreasing() {
+      return false;
+    }
+
+    @Override
+    public long getTargetValue(IntBlockTermState termState) {
+      return termState.singletonDocID;
+    }
+
+    @Override
+    public void setTargetValue(IntBlockTermState termState, long value) {
+      assert value <= Integer.MAX_VALUE;
+      // A correct codec implementation does not change the value,
+      // after the encode/decode round-trip it should still be a valid int
+      termState.singletonDocID = (int) value;
+    }
+  }
+
+  static final class DocFreq extends TermStateCodecComponent {
+    public static DocFreq INSTANCE = new DocFreq();
+
+    private DocFreq() {}
+
+    @Override
+    public boolean isMonotonicallyIncreasing() {
+      return false;
+    }
+
+    @Override
+    public long getTargetValue(IntBlockTermState termState) {
+      return termState.docFreq;
+    }
+
+    @Override
+    public void setTargetValue(IntBlockTermState termState, long value) {
+      assert value <= Integer.MAX_VALUE;
+      // A correct codec implementation does not change the value,
+      // after the encode/decode round-trip it should still be a valid int
+      termState.docFreq = (int) value;
+    }
+  }
+
+  static final class TotalTermFreq extends TermStateCodecComponent {
+    public static TotalTermFreq INSTANCE = new TotalTermFreq();
+
+    private TotalTermFreq() {}
+
+    @Override
+    public boolean isMonotonicallyIncreasing() {
+      return false;
+    }
+
+    @Override
+    public long getTargetValue(IntBlockTermState termState) {
+      return termState.totalTermFreq;
+    }
+
+    @Override
+    public void setTargetValue(IntBlockTermState termState, long value) {
+      termState.totalTermFreq = value;
+    }
+  }
+
+  static final class DocStartFP extends TermStateCodecComponent {
+    public static DocStartFP INSTANCE = new DocStartFP();
+
+    private DocStartFP() {}
+
+    @Override
+    public boolean isMonotonicallyIncreasing() {
+      return true;
+    }
+
+    @Override
+    public long getTargetValue(IntBlockTermState termState) {
+      return termState.docStartFP;
+    }
+
+    @Override
+    public void setTargetValue(IntBlockTermState termState, long value) {
+      termState.docStartFP = value;
+    }
+  }
+
+  static final class PositionStartFP extends TermStateCodecComponent {
+    public static PositionStartFP INSTANCE = new PositionStartFP();
+
+    private PositionStartFP() {}
+
+    @Override
+    public boolean isMonotonicallyIncreasing() {
+      return true;
+    }
+
+    @Override
+    public long getTargetValue(IntBlockTermState termState) {
+      return termState.posStartFP;
+    }
+
+    @Override
+    public void setTargetValue(IntBlockTermState termState, long value) {
+      termState.posStartFP = value;
+    }
+  }
+
+  static final class PayloadStartFP extends TermStateCodecComponent {
+    public static PayloadStartFP INSTANCE = new PayloadStartFP();
+
+    private PayloadStartFP() {}
+
+    @Override
+    public boolean isMonotonicallyIncreasing() {
+      return true;
+    }
+
+    @Override
+    public long getTargetValue(IntBlockTermState termState) {
+      return termState.payStartFP;
+    }
+
+    @Override
+    public void setTargetValue(IntBlockTermState termState, long value) {
+      termState.payStartFP = value;
+    }
+  }
+
+  static final class SkipOffset extends TermStateCodecComponent {
+    public static SkipOffset INSTANCE = new SkipOffset();
+
+    private SkipOffset() {}
+
+    @Override
+    public boolean isMonotonicallyIncreasing() {
+      return false;
+    }
+
+    @Override
+    public long getTargetValue(IntBlockTermState termState) {
+      return termState.skipOffset;
+    }
+
+    @Override
+    public void setTargetValue(IntBlockTermState termState, long value) {
+      termState.skipOffset = value;
+    }
+  }
+
+  static final class LastPositionBlockOffset extends TermStateCodecComponent {

Review Comment:
   Are these per-term metadata stored column-stride in block of terms?  I.e. all `LastPositionBlockOffset` for N terms in one block are encoded as a contiguous `byte[]`?  (And same for `skipOffset`, `totalTermFreq`, etc.)?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;

Review Comment:
   Because we peel off the three high bits of the `long` to encode the term type?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler<Long> fstCompiler =
+      new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+  TermsIndexBuilder() {
+    Arrays.fill(countPerType, -1);
+  }
+
+  public void addTerm(BytesRef term, TermType termType) throws IOException {
+    IntsRefBuilder scratchInts = new IntsRefBuilder();
+    long ord = ++countPerType[termType.getId()];
+    fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+  }
+
+  public TermsIndex build() throws IOException {
+    return new TermsIndex(fstCompiler.compile());
+  }
+
+  private long encode(long ord, TermType termType) {
+    // use a single long to encode `ord` and `termType`
+    // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+    // so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+    // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+    if (ord < 0) {
+      throw new IllegalArgumentException("can't encode negative ord");
+    }
+    if (ord > MAX_ORD) {
+      throw new IllegalArgumentException(
+          "Input ord is too large for TermType: "

Review Comment:
   More user friendly message here too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391575474


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodecImpl.java:
##########
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocFreq;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.LastPositionBlockOffset;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PayloadStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PositionStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SingletonDocId;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SkipOffset;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.TotalTermFreq;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {

Review Comment:
   Probably not --  I started off with an interface (perhaps because i was influenced by developing with Rust) since it has an emphasis on what exactly are needed. Also it makes mocks easier. (Though I didn't need it)
   
   Let's revisit this later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "dungba88 (via GitHub)" <gi...@apache.org>.
dungba88 commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1438004582


##########
lucene/codecs/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static final long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler<Long> fstCompiler;
+
+  TermsIndexBuilder() throws IOException {
+    fstCompiler =

Review Comment:
   Just FYI, you can plug the `indexOutput` directly to the FSTCompiler (passing from RandomAccessTermsDictWriter), and make the FST writing entirely off-heap. I have some example PRs at:
   - https://github.com/apache/lucene/pull/12985
   - https://github.com/apache/lucene/pull/12980



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1860625197

   The `PKLookup` gains are astounding!
   
   Especially interesting is the off -> on heap gains for that task.  We are somehow paying a high price for going through Lucene's IO APIs instead of `byte[]` backed `ByteBuffer`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1857380213

   ## Non-trivial amount of allocations for?  .... building IndexInput slice descriptions !?
   
   `jdk.internal.misc.Unsafe#allocateUninitializedArray()`. This was not trivial to find out why. But again with the raw JFR report, we can analyze the call tree. It turn out that in the [`buildSlice()` implementation](https://github.com/apache/lucene/blob/main/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L451) of `MemorySegmentIndexInput` we call [`IndexInput#getFullSliceDescription()`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/store/IndexInput.java#L128) which creates new String. And `allocateUninitializedArray` is called to allocate the bytes for the String. 
   
   AFAIK, the description is only used for debugging and tracking purposes. I didn't expect it'd cause that much of allocation. So I made a change to pass null when building the description so those allocations are gone. 
   
   
   ```
   Before 
   
   PERCENT       HEAP SAMPLES  STACK
   10.39%        12103M        java.lang.Long#valueOf()
   9.91%         11543M        org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
   8.91%         10383M        jdk.internal.misc.Unsafe#allocateUninitializedArray()
   
   After
   PERCENT       HEAP SAMPLES  STACK                                                                                                                              [37/1812]
   25.97%        32791M        java.lang.Long#valueOf()
   7.58%         9571M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
   5.13%         6482M         org.apache.lucene.util.FixedBitSet#<init>()
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1862184623

   Thanks @mikemccand for taking a look! I see the [getCommonSuffixBytesRef](https://github.com/apache/lucene/blob/5d6086e1994d766a3dd39a47b14a8cd80a7280e6/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java#L1161) method from `Automaton`.
   
   I wonder if it is really applicable to the FST... i.e. for the FST is it guaranteed that there exists one and only one state where all valid outputs path that share the same suffix go through?  Or put it in other words, how many sub-graphs of the FST are there that represents the same suffix? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "nknize (via GitHub)" <gi...@apache.org>.
nknize commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1382429884


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########


Review Comment:
   > Since they are of the same size...
   
   That's the difference. In your use case the records (blocks) are guaranteed to be the same size where as in the serialized tree case the records (tree nodes) are not guaranteed to be the same size. This is by design to ensure the resulting docvalue disk consumption is as efficient (small) as possible. 
   
   
   
   
   
   > ...by a quick glance it seems to me it encodes values with variable length (VInt, VLong). Maybe the random-access is achieved in different ways?
   
   Yes to variable length encoding. The "random-ness" isn't purely random in that traversal of the serialized tree is DFS. Because the tree nodes are variable size the serialized array includes copious "book-keeping" in the form of "sizeOf" values. At DFS traversal the first "sizeOf" value provides the size of the entire left tree. To prune the left tree just means we skip that many bytes to get to the right tree.. this continues recursively. In practice we don't expect to ever "back up" in our DFS traversal so there is only a `rewind` method that simply resets the offset values to 0. 
   
   
   Seems the two use cases are subtly different but I could see roughly 80% overlap in the implementation. I'd love to efficiently encapsulate this logic for the next contributor that wants a random serialized traversal mechanism without a ridiculous amount of java object overhead. Sounds like @bruno-roustant had the same need? Could be a good follow on progress PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381991598


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;

Review Comment:
   Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUPUTT` value for the FST. so we peel off 4 bits in total.



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;

Review Comment:
   Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUTPUT` value for the FST. so we peel off 4 bits in total.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1796863568

   Just realized that we have lucene99 Codec out! I'll update the code to reflect that as this posting format aims to work with the latest Codec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1825228925

   After some tweaking and tinkering I was finally able to get the bench running the way I wanted in luceneutil! @mikemccand 
   
   Unfortunately luceneutil out of the box doesn't work for my case...
   
   
   ```
                               TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value                                    [19/1802]
                           Wildcard       56.20      (2.0%)        7.30      (0.2%)  -87.0% ( -87% -  -86%) 0.000
                            Respell       44.00      (1.7%)       14.46      (0.6%)  -67.1% ( -68% -  -65%) 0.000
                             Fuzzy1       54.11      (1.2%)       20.86      (0.8%)  -61.4% ( -62% -  -60%) 0.000
                            Prefix3      103.61      (0.8%)       41.37      (0.5%)  -60.1% ( -60% -  -59%) 0.000
                             Fuzzy2       42.43      (1.2%)       20.26      (0.9%)  -52.3% ( -53% -  -50%) 0.000
                  HighTermTitleSort      127.60      (1.7%)      114.65      (1.6%)  -10.1% ( -13% -   -6%) 0.000
                  HighTermMonthSort     2549.22      (2.8%)     2435.74      (3.8%)   -4.5% ( -10% -    2%) 0.037
                         AndHighLow      708.99      (2.3%)      678.15      (2.1%)   -4.4% (  -8% -    0%) 0.001
                            LowTerm      369.16      (5.5%)      358.15      (3.1%)   -3.0% ( -10% -    5%) 0.287
                       OrNotHighLow      258.51      (1.8%)      252.11      (2.2%)   -2.5% (  -6% -    1%) 0.054
                          OrHighLow      348.58      (1.1%)      340.11      (2.5%)   -2.4% (  -5% -    1%) 0.046
                      OrNotHighHigh      141.31      (6.7%)      138.54      (3.2%)   -2.0% ( -11% -    8%) 0.554
                             IntNRQ       17.93      (1.8%)       17.62      (3.7%)   -1.7% (  -7% -    3%) 0.349
                    MedSloppyPhrase       26.73      (0.8%)       26.28      (1.3%)   -1.7% (  -3% -    0%) 0.017
               HighIntervalsOrdered        3.88      (2.6%)        3.82      (2.3%)   -1.6% (  -6% -    3%) 0.314
                           HighTerm      290.77      (7.3%)      286.20      (6.9%)   -1.6% ( -14% -   13%) 0.727
                      OrHighNotHigh      296.51      (6.9%)      291.90      (4.9%)   -1.6% ( -12% -   11%) 0.682
                LowIntervalsOrdered       15.48      (1.1%)       15.26      (1.3%)   -1.4% (  -3% -    0%) 0.057
                            MedTerm      405.54      (6.3%)      399.77      (5.9%)   -1.4% ( -12% -   11%) 0.713
                    LowSloppyPhrase       44.64      (0.5%)       44.04      (2.1%)   -1.3% (  -3% -    1%) 0.171
              HighTermDayOfYearSort      177.36      (1.6%)      175.08      (1.8%)   -1.3% (  -4% -    2%) 0.234
                          OrHighMed       79.44      (3.7%)       78.48      (3.0%)   -1.2% (  -7% -    5%) 0.572
                       OrNotHighMed      268.82      (4.3%)      265.74      (3.1%)   -1.1% (  -8% -    6%) 0.632
              BrowseMonthTaxoFacets        3.89      (0.4%)        3.85      (1.2%)   -0.9% (  -2% -    0%) 0.134
            AndHighMedDayTaxoFacets       44.33      (0.7%)       43.96      (1.1%)   -0.8% (  -2% -    0%) 0.145
                        MedSpanNear       30.67      (1.1%)       30.42      (2.3%)   -0.8% (  -4% -    2%) 0.481
                        LowSpanNear        4.56      (0.8%)        4.53      (2.1%)   -0.6% (  -3% -    2%) 0.576
                       HighSpanNear        8.52      (1.4%)        8.47      (2.1%)   -0.5% (  -3% -    2%) 0.636
                       OrHighNotLow      236.32      (6.6%)      235.28      (4.4%)   -0.4% ( -10% -   11%) 0.901
           AndHighHighDayTaxoFacets        4.31      (0.6%)        4.29      (0.8%)   -0.4% (  -1% -    1%) 0.446
                          LowPhrase       69.90      (1.2%)       69.67      (2.7%)   -0.3% (  -4% -    3%) 0.807
                         AndHighMed       41.87      (0.6%)       41.75      (2.7%)   -0.3% (  -3% -    3%) 0.828
                         OrHighHigh       45.86      (7.4%)       45.77      (7.9%)   -0.2% ( -14% -   16%) 0.969
                         TermDTSort      101.90      (2.3%)      101.84      (2.0%)   -0.1% (  -4% -    4%) 0.966
                   HighSloppyPhrase        0.48      (2.2%)        0.48      (2.6%)   -0.0% (  -4% -    4%) 0.995
               BrowseDateSSDVFacets        1.00      (3.7%)        1.00      (5.3%)   -0.0% (  -8% -    9%) 1.000
               HighTermTitleBDVSort        4.48      (5.4%)        4.49      (4.4%)    0.0% (  -9% -   10%) 0.990
          BrowseDayOfYearSSDVFacets        3.51     (13.4%)        3.52     (13.3%)    0.1% ( -23% -   30%) 0.990
              BrowseMonthSSDVFacets        3.38      (0.7%)        3.39      (0.6%)    0.1% (  -1% -    1%) 0.740
          BrowseDayOfYearTaxoFacets        3.85      (0.3%)        3.85      (0.5%)    0.2% (   0% -    0%) 0.469
                MedIntervalsOrdered        1.65      (1.5%)        1.65      (1.5%)    0.2% (  -2% -    3%) 0.802
               BrowseDateTaxoFacets        3.82      (0.3%)        3.83      (0.4%)    0.3% (   0% -    0%) 0.264
        BrowseRandomLabelSSDVFacets        2.34      (1.0%)        2.35      (0.5%)    0.3% (  -1% -    1%) 0.494
                       OrHighNotMed      296.34      (7.6%)      297.38      (4.6%)    0.4% ( -11% -   13%) 0.930
        BrowseRandomLabelTaxoFacets        3.29      (0.6%)        3.31      (0.7%)    0.5% (   0% -    1%) 0.275
                          MedPhrase        9.84      (2.1%)        9.90      (4.1%)    0.6% (  -5% -    6%) 0.787
                        AndHighHigh       33.50      (1.2%)       33.73      (3.4%)    0.7% (  -3% -    5%) 0.669
                         HighPhrase       15.15      (2.3%)       15.28      (4.1%)    0.9% (  -5% -    7%) 0.679
               MedTermDayTaxoFacets       13.36      (1.8%)       13.55      (2.0%)    1.4% (  -2% -    5%) 0.240
             OrHighMedDayTaxoFacets        3.73      (1.4%)        3.83      (2.2%)    2.6% (   0% -    6%) 0.026
                           PKLookup      147.84      (1.6%)      157.37      (1.5%)    6.4% (   3% -    9%) 0.000
   ```
   
   ### Observations:
   #### PKLookup has improvement 
   This is reasonable as the terms index (FST) holds all the terms.
   
   #### Fuzzy/Wildcard/Prefix queries got *much slower* 
   This is also expected because currently I used the default implementation provided by `TermsEnum` which does not take advantage of the FST. With an optimized implementation I expect it to at least be on-par and slightly better because the FST holds information about all terms, whereas the current BlockTreeTerms only holds prefixes. 
   
   #### `HighTermTitleSort` and `HighTermMonthSort` got about 4.5% ~ 10% less throughput 
   I don't quite understand why term lookup could
   
   
   #### `AndHighLow` got slower
   Am i missing some optimization opportunity for low freq terms?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381994228


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler<Long> fstCompiler =
+      new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+  TermsIndexBuilder() {
+    Arrays.fill(countPerType, -1);
+  }
+
+  public void addTerm(BytesRef term, TermType termType) throws IOException {
+    IntsRefBuilder scratchInts = new IntsRefBuilder();
+    long ord = ++countPerType[termType.getId()];
+    fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+  }
+
+  public TermsIndex build() throws IOException {
+    return new TermsIndex(fstCompiler.compile());
+  }
+
+  private long encode(long ord, TermType termType) {
+    // use a single long to encode `ord` and `termType`
+    // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+    // so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+    // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+    if (ord < 0) {
+      throw new IllegalArgumentException("can't encode negative ord");
+    }
+    if (ord > MAX_ORD) {
+      throw new IllegalArgumentException(
+          "Input ord is too large for TermType: "

Review Comment:
   Did you mean to spell out the input ord? 



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/Lucene90RandomAccessDictionaryPostingsFormat.java:
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.FieldsConsumer;
+import org.apache.lucene.codecs.FieldsProducer;
+import org.apache.lucene.codecs.PostingsFormat;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import org.apache.lucene.codecs.PostingsWriterBase;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsReader;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsWriter;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Similar to {@link Lucene90PostingsFormat} but with a different term dictionary implementation.

Review Comment:
   > BTW I think FST will pack better if you put the 3 "term type" bits at the end (lsb) of the long instead of msb, because then prefixes can be shared across all term types, instead of just within each term type. But, this may not be a win because it may take more vLong bytes to write each delta, not sure :)
   
   I believe this is what it is currently doing :) it will not create full 64bit long this way. To be specific -- 
   * bit 0: indicate it is not `NO_OUTPUT` always set to 1 for valid (termType, ord)
   * bit [1, 4): termType
   * bit [4, 64): ord value
   
   bit index is from low to high, bit 0 is least significant bit.
   
   https://github.com/Tony-X/lucene/blob/ramdon_access_term_dict/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java#L68
   
   [code](https://github.com/Tony-X/lucene/blob/ramdon_access_term_dict/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java#L68)
   ```java
   return (ord << 4) | ((long) termType.getId() << 1) | 1L;
   ```



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########


Review Comment:
   Hey Knize, thanks for reviewing. 
   
   This is not test-only. It is used by the `TestTermStateCodecImpl`. I'm in the process of building the real compact bit packer. 
   
   > I'm also curious every time I see a new bit packer as we do this a lot throughout the code. Is there some reuse from another class impl maybe? PackedInts? DataOutput#writeVLong? Do we need it?
   
   I did search through the code base and couldn't find something I can use. The goal here is to pack a sequence of values that have different bitwidths . We can't use `PackedInts` as it requires values to have same bitwidth. We can't use VLong either since we aim to write fixed size record, so that we can do random access.
   
   More detailed discussion can be found in this email thread: https://lists.apache.org/list?dev@lucene.apache.org:lte=1M:packedInts



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "bruno-roustant (via GitHub)" <gi...@apache.org>.
bruno-roustant commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1772589968

   I'll also try to review!
   On the bit packing subject, I have some handy generic code (not in Lucene yet) to write and read variable size bits. Tell me if you are interested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1890169118

   This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1792903331

   > It seems like you have the low level encode/decode working? So all that remains is to hook that up with the Codec components that read/write the terms dict ... then you can test the Codec by passing -Dtests.codec=<name> and Lucene will run all tests cases using your Codec.
   
   Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1382113535


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########


Review Comment:
   > +1. I'm assuming that will be added to this PR?
   
   Yes! This PR can be large so I took the advice from @mikemccand  to open it early to avoid massive diff in one shot. Goal of the PR is to have a fully functional PostingFormat (or Codec). 
   
   > The reason I ask is because I did a similar bit packing w/ "random access" when serializing the [ShapeDocValues binary tree](https://github.com/apache/lucene/blob/d6836d3d0e5d33a98b35c0885b9787f46c4be47e/lucene/core/src/java/org/apache/lucene/document/ShapeDocValues.java#L462C26-L462C26) and it feels like we often re-implement this logic in different forms for different use cases. Can we generalize this and lean out our code base to make it more useable and readable?
   
   Not sure if this code does the same thing. I could be wrong, but by a quick glance it seems to me it encodes values with variable length (VInt, VLong). Maybe the random-access is achieved in different ways? 
   
   Here in this PR, the use case is -- I have a bunch of bits in the form of byte[] that represents a block records that have same size (measure in bits, but size can be > 64 so we can't use PackedInts). Since they are of the same size, we can randomly access any record with an index and read the bits at [index * size, (index+1) * size] 
   
   I do agree that we should seek opportunities to unify. But for now since this is under sandbox, I'll make it specific to this implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "nknize (via GitHub)" <gi...@apache.org>.
nknize commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1380840508


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########


Review Comment:
   Looks like this is only used by tests? Maybe move to the tests package? I'm also curious every time I see a new bit packer as we do this a lot throughout the code. Is there some reuse from another class impl maybe? PackedInts? DataOutput#writeVLong? Do we need it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1808041346

   > Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec?
   
   You don't really need a new `Codec` -- you could test this easily by subclassing `Lucene99Codec` and overriding `getPostingsFormatForField` to use your new cool bit packing `PostingsFormat`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "bruno-roustant (via GitHub)" <gi...@apache.org>.
bruno-roustant commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1773923204

   This is some code I wrote a long time ago. It has been tested and used, so
   I'm confident on the functional aspect, and it might benefit from a
   benchmark for perf.
   
   Le ven. 20 oct. 2023 à 19:20, Tony-X ***@***.***> a écrit :
   
   > Thanks @bruno-roustant <https://github.com/bruno-roustant> ! If you're
   > okay to share it feel free to share it here.
   >
   > I'm in the process of baking my own specific implementation (which
   > internally uses a single long as bit buffer), but I might absorb some
   > interesting ideas from your impl.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/lucene/pull/12688#issuecomment-1773113866>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AIC45DG57A67JAS2CWX4SDLYAKXF5AVCNFSM6AAAAAA6C3VWVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTGEYTGOBWGY>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1365702707


##########
lucene/core/src/java/module-info.java:
##########
@@ -35,6 +35,7 @@
   exports org.apache.lucene.codecs.lucene95;
   exports org.apache.lucene.codecs.lucene90.blocktree;
   exports org.apache.lucene.codecs.lucene90.compressing;
+  exports org.apache.lucene.codecs.lucene90.radomaccess;

Review Comment:
   Let's put this (experimental) Codec in `sandbox` to begin with?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermType.java:
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+import java.util.Objects;
+
+class TermType {
+    private static final byte SINGLETON_DOC_MASK = (byte) 1;
+
+    private static final byte HAS_SKIP_DATA_MASK = (byte) 1 << 1;
+
+    private static final byte HAS_VINT_POSITION_BLOCK_MASK = (byte) 1 << 2;
+
+    public static final int NUM_TOTAL_TYPES = 8;

Review Comment:
   Maybe add a comment explaining what a "type" is in this context?  Is it all permutations of those three flags?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the ordinals
+ * are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+    private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+    private final FSTCompiler<Long> fstCompiler =
+            new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+    TermsIndexBuilder() {
+        Arrays.fill(countPerType, -1);
+    }
+
+    public void addTerm(BytesRef term, TermType termType) throws IOException {
+        IntsRefBuilder scratchInts = new IntsRefBuilder();
+        long ord = ++countPerType[termType.getId()];
+        fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+    }
+
+    public TermsIndex build() throws IOException {
+        return new TermsIndex(fstCompiler.compile());
+    }
+
+    private long encode(long ord, TermType termType) {
+        // use a single long to encode `ord` and `termType`
+        // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+        // so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+        // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+        if ( ord < 0) {
+            throw new IllegalArgumentException("can't encode negative ord");
+        }
+        if ( ord > ((1L << 60) - 1) ) {
+            throw new IllegalArgumentException("Input ord is too large");

Review Comment:
   Maybe a more user-friendly error, like "too many terms of type X" or so?  And maybe include the limit (`1L << 60`) in the exception message?



##########
lucene/core/src/java/module-info.java:
##########
@@ -35,6 +35,7 @@
   exports org.apache.lucene.codecs.lucene95;
   exports org.apache.lucene.codecs.lucene90.blocktree;
   exports org.apache.lucene.codecs.lucene90.compressing;
+  exports org.apache.lucene.codecs.lucene90.radomaccess;

Review Comment:
   Also, `radomaccess` is missing an `n` --> `randomaccess`.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndex.java:
##########
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.util.fst.FST;
+
+
+record TermsIndex(FST<Long> fst) {

Review Comment:
   Ooooh a `record`!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1373800366


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndex.java:
##########
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.util.fst.FST;
+
+
+record TermsIndex(FST<Long> fst) {

Review Comment:
   yup! you get equals() and hashCode() for free!



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the ordinals
+ * are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+    private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+    private final FSTCompiler<Long> fstCompiler =
+            new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+    TermsIndexBuilder() {
+        Arrays.fill(countPerType, -1);
+    }
+
+    public void addTerm(BytesRef term, TermType termType) throws IOException {
+        IntsRefBuilder scratchInts = new IntsRefBuilder();
+        long ord = ++countPerType[termType.getId()];
+        fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+    }
+
+    public TermsIndex build() throws IOException {
+        return new TermsIndex(fstCompiler.compile());
+    }
+
+    private long encode(long ord, TermType termType) {
+        // use a single long to encode `ord` and `termType`
+        // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+        // so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+        // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+        if ( ord < 0) {
+            throw new IllegalArgumentException("can't encode negative ord");
+        }
+        if ( ord > ((1L << 60) - 1) ) {
+            throw new IllegalArgumentException("Input ord is too large");

Review Comment:
   Yeah, makes sense. your suggestion would make the error more usable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "dungba88 (via GitHub)" <gi...@apache.org>.
dungba88 commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1438014984


##########
lucene/codecs/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static final long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler<Long> fstCompiler;
+
+  TermsIndexBuilder() throws IOException {
+    fstCompiler =

Review Comment:
   Note that the FST metadata and data cannot be written to the same file, but it seems you already separated `metaOutput` and `indexOutput` so that should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391566961


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodec.java:
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.util.BytesRef;
+
+interface TermStateCodec {

Review Comment:
   I know.... I started off calling it `TermStateEncoder` but I soon realized it needs decoding, too. 
   
   I'm open to change its names :) 
   
   One good thing is that this is totally implementation detail which needs not to be exposed outside this package.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Random access term dictionary [lucene]

Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1857371557

   Since the first working version, I iterated with a list of profiling-guided allocation optimizations, as they stood out quite obviously from the merged JFR reports (thanks to luceneutil !). 
   
   Some of them comes from my code that implements the term dictionary data lookup, and a few of them are at more general Lucene level. I want to highlight the general issue I see from this work and maybe we can have separate issues to improve them!
   
   Here is the first heap profile comparison (search-only, no indexing). 
   
   ```
   Candidate Heap
   17.50%        24440M        java.lang.Long#valueOf()
   10.09%        14096M        jdk.internal.misc.Unsafe#allocateUninitializedArray()
   6.87%         9594M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
   4.40%         6140M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
   ...
   ```
   
   ```
   main
   13.65%        11898M        org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
   9.26%         8071M         org.apache.lucene.util.FixedBitSet#<init>()
   6.70%         5836M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
   6.60%         5751M         org.apache.lucene.util.ArrayUtil#growExact()
   5.21%         4541M         org.apache.lucene.facet.FacetsConfig#stringToPath()
   4.69%         4090M         org.apache.lucene.util.DocIdSetBuilder$Buffer#<init>()
   ```
   
   ## FST doesn't play nicely with primitive types (I know, this is more or less a java issue)
   
   `24440M java.lang.Long#valueOf()` huge amount of allocations... This is obvious. The FST<T> implementation is generic over its output type and in my case T is `Long`. So for trivial `long` add and subtract, the implementation would allocate an object. Not only it is wasteful but from a perf perspective it'd be less than 1 CPU cycle v.s. calling allocator which is easily tens if not hundreds of cycles. 
   
   For this work, I forked the FST<T> class and manually templated it with long just to see how much difference it makes. Here is a diff in heap profile and bench results before and after.
   
   
   ```
   Before 
   
   PERCENT       HEAP SAMPLES  STACK                                                                                                                              
   25.97%        32791M        java.lang.Long#valueOf()
   7.58%         9571M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
   5.13%         6482M         org.apache.lucene.util.FixedBitSet#<init>()
   4.90%     
   ....
   
   After
   PERCENT       HEAP SAMPLES  STACK
   8.44%         7988M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
   7.17%         6788M         org.apache.lucene.util.FixedBitSet#<init>()
   6.22%         5886M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
   5.89%         5577M         org.apache.lucene.util.ArrayUtil#growExact()
   
   ```
   
   Bench diff
   
   ```
   Before
   
                               TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                           Wildcard       11.61      (2.7%)        2.40      (0.6%)  -79.4% ( -80% -  -78%) 0.000
                             Fuzzy1       78.17      (0.7%)       27.16      (0.9%)  -65.3% ( -66% -  -64%) 0.000
                            Respell       29.09      (0.6%)       10.91      (0.8%)  -62.5% ( -63% -  -61%) 0.000
                             Fuzzy2       47.80      (0.6%)       18.50      (1.2%)  -61.3% ( -62% -  -59%) 0.000
                            Prefix3      765.08      (3.1%)      463.94      (0.9%)  -39.4% ( -42% -  -36%) 0.000
                  HighTermTitleSort       98.48      (2.0%)       90.62      (2.2%)   -8.0% ( -11% -   -3%) 0.000
              BrowseMonthTaxoFacets        3.89     (29.2%)        3.62      (0.9%)   -6.9% ( -28% -   32%) 0.293
                    LowSloppyPhrase       22.73      (6.5%)       22.35      (6.9%)   -1.7% ( -14% -   12%) 0.432
                            LowTerm      365.47      (3.6%)      359.62      (2.9%)   -1.6% (  -7% -    5%) 0.121
                           HighTerm      398.57      (5.1%)      393.16      (4.7%)   -1.4% ( -10% -    8%) 0.380
                    MedSloppyPhrase       10.63      (3.6%)       10.51      (3.7%)   -1.1% (  -8% -    6%) 0.339
                            MedTerm      422.73      (4.2%)      418.60      (4.0%)   -1.0% (  -8% -    7%) 0.451
               MedTermDayTaxoFacets       14.84      (2.6%)       14.71      (2.5%)   -0.8% (  -5% -    4%) 0.296
                   HighSloppyPhrase       12.41      (3.1%)       12.33      (3.1%)   -0.7% (  -6% -    5%) 0.487
               HighTermTitleBDVSort        6.88      (3.3%)        6.84      (3.5%)   -0.6% (  -7% -    6%) 0.599
                          LowPhrase       58.15      (2.9%)       57.85      (2.8%)   -0.5% (  -6% -    5%) 0.567
          BrowseDayOfYearSSDVFacets        3.24      (0.4%)        3.23      (0.5%)   -0.3% (  -1% -    0%) 0.042
                          MedPhrase       26.19      (3.1%)       26.11      (3.2%)   -0.3% (  -6% -    6%) 0.775
                       OrNotHighMed      185.23      (3.9%)      184.73      (3.3%)   -0.3% (  -7% -    7%) 0.813
             OrHighMedDayTaxoFacets        3.82      (3.3%)        3.81      (3.2%)   -0.3% (  -6% -    6%) 0.796
                       OrHighNotLow      194.98      (5.1%)      194.51      (4.6%)   -0.2% (  -9% -   10%) 0.875
                       OrHighNotMed      337.15      (4.4%)      336.53      (3.8%)   -0.2% (  -7% -    8%) 0.888
                             IntNRQ       67.60      (0.9%)       67.55      (1.0%)   -0.1% (  -1% -    1%) 0.783
                        MedSpanNear        9.85      (1.4%)        9.84      (2.1%)   -0.1% (  -3% -    3%) 0.906
                      OrNotHighHigh      205.12      (4.1%)      205.01      (3.9%)   -0.1% (  -7% -    8%) 0.967
           AndHighHighDayTaxoFacets        6.35      (1.5%)        6.34      (1.7%)   -0.0% (  -3% -    3%) 0.932
              BrowseMonthSSDVFacets        3.29      (0.8%)        3.29      (0.7%)   -0.0% (  -1% -    1%) 0.887
        BrowseRandomLabelSSDVFacets        2.30      (0.8%)        2.30      (1.0%)    0.0% (  -1% -    1%) 0.919
                        LowSpanNear       16.41      (2.6%)       16.42      (2.7%)    0.1% (  -5% -    5%) 0.931
                         HighPhrase       77.12      (3.0%)       77.20      (3.6%)    0.1% (  -6% -    6%) 0.923
            AndHighMedDayTaxoFacets       39.64      (1.2%)       39.68      (1.0%)    0.1% (  -2% -    2%) 0.742
        BrowseRandomLabelTaxoFacets        3.19      (1.6%)        3.19      (1.1%)    0.1% (  -2% -    2%) 0.728
               BrowseDateTaxoFacets        3.73      (0.7%)        3.74      (0.5%)    0.3% (   0% -    1%) 0.157
                        AndHighHigh       27.08      (1.3%)       27.15      (3.0%)    0.3% (  -4% -    4%) 0.718
          BrowseDayOfYearTaxoFacets        3.76      (0.6%)        3.77      (0.5%)    0.3% (   0% -    1%) 0.072
              HighTermDayOfYearSort      224.01      (2.1%)      224.81      (2.1%)    0.4% (  -3% -    4%) 0.592
                       HighSpanNear        6.09      (2.7%)        6.11      (3.1%)    0.4% (  -5% -    6%) 0.683
               HighIntervalsOrdered        8.08      (3.3%)        8.11      (3.4%)    0.4% (  -6% -    7%) 0.705
                         TermDTSort      103.29      (4.4%)      103.83      (3.1%)    0.5% (  -6% -    8%) 0.669
                MedIntervalsOrdered       33.12      (4.4%)       33.29      (4.6%)    0.5% (  -8% -    9%) 0.702
                LowIntervalsOrdered       10.06      (3.9%)       10.12      (3.6%)    0.6% (  -6% -    8%) 0.609
                         AndHighMed       73.71      (2.2%)       74.18      (2.5%)    0.6% (  -3% -    5%) 0.394
                          OrHighMed       71.44      (2.7%)       71.98      (3.3%)    0.7% (  -5% -    6%) 0.429
               BrowseDateSSDVFacets        0.96      (4.8%)        0.97      (5.7%)    0.9% (  -9% -   11%) 0.601
                      OrHighNotHigh      308.82      (4.0%)      311.53      (3.7%)    0.9% (  -6% -    8%) 0.470
                          OrHighLow      404.69      (3.0%)      408.63      (3.5%)    1.0% (  -5% -    7%) 0.348
                         OrHighHigh       20.44      (4.7%)       20.73      (7.2%)    1.4% ( -10% -   13%) 0.469
                       OrNotHighLow      381.28      (1.8%)      388.18      (2.1%)    1.8% (  -2% -    5%) 0.004
                  HighTermMonthSort     2500.04      (2.2%)     2554.91      (4.3%)    2.2% (  -4% -    8%) 0.042
                         AndHighLow      668.12      (3.1%)      692.04      (3.9%)    3.6% (  -3% -   10%) 0.001
                           PKLookup      140.25      (2.0%)      168.53      (1.9%)   20.2% (  15% -   24%) 0.000
   
   After
   
                               TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                           Wildcard       54.96      (2.6%)       10.43      (0.5%)  -81.0% ( -82% -  -80%) 0.000
                            Respell       45.54      (1.0%)       16.74      (0.7%)  -63.2% ( -64% -  -62%) 0.000
                             Fuzzy1       46.41      (1.2%)       17.26      (1.0%)  -62.8% ( -64% -  -61%) 0.000
                            Prefix3      121.65      (2.4%)       55.57      (0.9%)  -54.3% ( -56% -  -52%) 0.000
                             Fuzzy2       32.33      (1.2%)       15.79      (1.1%)  -51.2% ( -52% -  -49%) 0.000
                  HighTermTitleSort       95.24      (2.1%)       87.04      (1.9%)   -8.6% ( -12% -   -4%) 0.000
        BrowseRandomLabelSSDVFacets        2.37      (7.1%)        2.33      (4.8%)   -1.7% ( -12% -   10%) 0.374
              BrowseMonthSSDVFacets        3.34      (7.3%)        3.29      (0.6%)   -1.5% (  -8% -    6%) 0.362
                         TermDTSort      120.57      (2.3%)      119.02      (3.4%)   -1.3% (  -6% -    4%) 0.163
                         OrHighHigh       19.13      (5.6%)       18.92      (2.7%)   -1.1% (  -8% -    7%) 0.430
                        AndHighHigh       22.04      (5.1%)       21.87      (3.0%)   -0.8% (  -8% -    7%) 0.555
                         AndHighMed       55.06      (3.0%)       54.79      (2.1%)   -0.5% (  -5% -    4%) 0.546
                       HighSpanNear        3.29      (1.6%)        3.28      (1.7%)   -0.5% (  -3% -    2%) 0.346
               HighIntervalsOrdered        0.65      (1.8%)        0.65      (2.0%)   -0.5% (  -4% -    3%) 0.433
              HighTermDayOfYearSort      282.86      (2.0%)      281.57      (2.6%)   -0.5% (  -4% -    4%) 0.533
                MedIntervalsOrdered       16.36      (1.5%)       16.29      (1.5%)   -0.4% (  -3% -    2%) 0.369
                          OrHighMed       68.27      (2.9%)       67.99      (1.8%)   -0.4% (  -5% -    4%) 0.598
                        MedSpanNear        3.22      (1.0%)        3.21      (1.4%)   -0.4% (  -2% -    2%) 0.317
                   HighSloppyPhrase        9.59      (2.5%)        9.57      (2.6%)   -0.3% (  -5% -    4%) 0.733
              BrowseMonthTaxoFacets        3.64      (2.4%)        3.63      (1.8%)   -0.2% (  -4% -    4%) 0.756
                LowIntervalsOrdered       14.66      (0.9%)       14.63      (1.5%)   -0.2% (  -2% -    2%) 0.633
               MedTermDayTaxoFacets       15.56      (2.7%)       15.54      (3.9%)   -0.2% (  -6% -    6%) 0.879
            AndHighMedDayTaxoFacets       18.70      (1.4%)       18.67      (3.7%)   -0.2% (  -5% -    5%) 0.864
                        LowSpanNear        4.39      (1.1%)        4.38      (1.4%)   -0.1% (  -2% -    2%) 0.728
             OrHighMedDayTaxoFacets        5.38      (3.5%)        5.38      (5.4%)   -0.1% (  -8% -    9%) 0.945
           AndHighHighDayTaxoFacets        7.06      (1.6%)        7.06      (3.0%)   -0.1% (  -4% -    4%) 0.924
                    LowSloppyPhrase        7.16      (1.4%)        7.15      (1.6%)   -0.1% (  -2% -    2%) 0.891
                    MedSloppyPhrase      128.54      (1.9%)      128.56      (2.2%)    0.0% (  -4% -    4%) 0.979
                            LowTerm      417.80      (3.3%)      418.01      (2.7%)    0.1% (  -5% -    6%) 0.958
                          LowPhrase      125.59      (4.0%)      125.77      (3.1%)    0.1% (  -6% -    7%) 0.900
                          OrHighLow      313.22      (2.1%)      313.72      (2.2%)    0.2% (  -4% -    4%) 0.817
               BrowseDateTaxoFacets        3.73      (0.7%)        3.74      (0.7%)    0.2% (  -1% -    1%) 0.470
          BrowseDayOfYearTaxoFacets        3.76      (0.7%)        3.76      (0.7%)    0.2% (  -1% -    1%) 0.457
                            MedTerm      384.57      (4.6%)      385.44      (3.6%)    0.2% (  -7% -    8%) 0.863
                      OrHighNotHigh      255.07      (4.3%)      256.05      (4.3%)    0.4% (  -7% -    9%) 0.778
                          MedPhrase       11.17      (3.0%)       11.21      (2.6%)    0.4% (  -5% -    6%) 0.658
                           HighTerm      361.26      (5.1%)      362.86      (4.2%)    0.4% (  -8% -   10%) 0.764
        BrowseRandomLabelTaxoFacets        3.19      (1.5%)        3.20      (0.6%)    0.5% (  -1% -    2%) 0.203
                      OrNotHighHigh      205.38      (4.0%)      206.35      (4.0%)    0.5% (  -7% -    8%) 0.712
                       OrNotHighLow      317.96      (1.7%)      319.48      (2.1%)    0.5% (  -3% -    4%) 0.428
                         HighPhrase       47.91      (3.8%)       48.15      (3.3%)    0.5% (  -6% -    7%) 0.661
               BrowseDateSSDVFacets        0.97      (6.9%)        0.98      (6.7%)    0.5% ( -12% -   15%) 0.801
                       OrHighNotLow      185.96      (4.9%)      187.04      (5.0%)    0.6% (  -8% -   11%) 0.710
          BrowseDayOfYearSSDVFacets        3.21      (2.1%)        3.23      (0.9%)    0.6% (  -2% -    3%) 0.225
               HighTermTitleBDVSort        5.83      (3.7%)        5.87      (4.0%)    0.7% (  -6% -    8%) 0.584
                       OrNotHighMed      516.84      (2.5%)      520.76      (2.5%)    0.8% (  -4% -    5%) 0.334
                             IntNRQ       29.24      (3.0%)       29.50      (4.1%)    0.9% (  -6% -    8%) 0.425
                       OrHighNotMed      268.45      (4.4%)      270.92      (4.2%)    0.9% (  -7% -    9%) 0.501
                  HighTermMonthSort     2498.46      (4.8%)     2590.43      (3.7%)    3.7% (  -4% -   12%) 0.007
                         AndHighLow      747.94      (2.1%)      775.60      (4.0%)    3.7% (  -2% -   10%) 0.000
                           PKLookup      141.68      (2.0%)      177.85      (1.5%)   25.5% (  21% -   29%) 0.000
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org