You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Tony-X (via GitHub)" <gi...@apache.org> on 2023/10/16 22:11:07 UTC
[PR] Random access term dictionary [lucene]
Tony-X opened a new pull request, #12688:
URL: https://github.com/apache/lucene/pull/12688
### Description
Related issue https://github.com/apache/lucene/issues/12513
Opening this PR early to avoid massive diffs in one-shot
- [x] Encode (term type, local ord) in FST
TODO:
- [ ] Implement bit-packing and unpacking for each term type
- [ ] Implement the PostingsFormat
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1773113866
Thanks @bruno-roustant ! If you're okay to share it feel free to share it here.
I'm in the process of baking my own specific implementation (which internally uses a single long as bit buffer), but I might absorb some interesting ideas from your impl.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1857417012
Here is the even more interesting stuff. After all those allocation optimizations. I also implemented the on-paper more "efficient" algorithm to intersect FST and FSA for Terms.intersect(), which leverages the sorted nature of the FST arcs and FSA transitions from a given state (so at least we can binary search to advance with some skipping). FST in some cases have direct addressing which is exploited, too. As a side note -- it also uncovered a bug for the NFA impl which is tracked here https://github.com/apache/lucene/issues/12906.
But that change is not moving the needle at all for `WildCard` and `Prefix3` search tasks.
```
Before
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Wildcard 62.85 (1.8%) 12.66 (0.6%) -79.9% ( -80% - -78%) 0.000
Fuzzy2 55.12 (1.0%) 18.92 (0.8%) -65.7% ( -66% - -64%) 0.000
Fuzzy1 61.20 (0.8%) 22.55 (0.8%) -63.2% ( -64% - -62%) 0.000
Respell 31.11 (0.8%) 11.95 (0.6%) -61.6% ( -62% - -60%) 0.000
Prefix3 135.69 (2.0%) 65.06 (0.7%) -52.1% ( -53% - -50%) 0.000
HighTermTitleSort 119.58 (0.9%) 111.03 (1.7%) -7.2% ( -9% - -4%) 0.000
IntNRQ 22.25 (1.1%) 21.87 (1.5%) -1.7% ( -4% - 0%) 0.000
HighPhrase 25.82 (3.6%) 25.55 (3.2%) -1.1% ( -7% - 5%) 0.318
MedPhrase 7.41 (2.4%) 7.35 (2.2%) -0.8% ( -5% - 3%) 0.259
LowSpanNear 8.81 (1.9%) 8.74 (2.1%) -0.8% ( -4% - 3%) 0.202
OrHighMedDayTaxoFacets 3.86 (5.8%) 3.83 (4.8%) -0.8% ( -10% - 10%) 0.636
TermDTSort 100.75 (2.9%) 99.98 (2.0%) -0.8% ( -5% - 4%) 0.336
HighIntervalsOrdered 6.07 (2.1%) 6.03 (2.4%) -0.7% ( -5% - 3%) 0.342
MedIntervalsOrdered 45.89 (2.0%) 45.61 (2.4%) -0.6% ( -4% - 3%) 0.389
HighSpanNear 10.73 (1.0%) 10.66 (1.5%) -0.6% ( -3% - 2%) 0.165
HighTermDayOfYearSort 206.09 (1.8%) 204.93 (1.9%) -0.6% ( -4% - 3%) 0.338
LowIntervalsOrdered 8.39 (2.3%) 8.37 (2.5%) -0.3% ( -5% - 4%) 0.654
MedSpanNear 66.00 (1.3%) 65.81 (1.9%) -0.3% ( -3% - 2%) 0.574
MedTerm 322.61 (4.7%) 321.89 (4.5%) -0.2% ( -9% - 9%) 0.878
AndHighMedDayTaxoFacets 22.62 (1.0%) 22.58 (1.2%) -0.2% ( -2% - 2%) 0.617
LowPhrase 48.52 (1.3%) 48.46 (1.4%) -0.1% ( -2% - 2%) 0.745
LowTerm 403.54 (2.9%) 403.22 (2.4%) -0.1% ( -5% - 5%) 0.923
BrowseRandomLabelTaxoFacets 3.20 (0.7%) 3.20 (0.9%) -0.0% ( -1% - 1%) 0.905
AndHighHighDayTaxoFacets 8.06 (1.3%) 8.06 (1.6%) 0.0% ( -2% - 2%) 0.962
BrowseDayOfYearTaxoFacets 3.76 (0.6%) 3.76 (0.6%) 0.0% ( -1% - 1%) 0.859
BrowseMonthTaxoFacets 3.62 (1.0%) 3.62 (1.0%) 0.1% ( -1% - 2%) 0.866
OrHighNotHigh 156.16 (6.0%) 156.26 (5.9%) 0.1% ( -11% - 12%) 0.972
BrowseDateTaxoFacets 3.73 (0.6%) 3.73 (0.6%) 0.1% ( -1% - 1%) 0.722
OrNotHighHigh 144.55 (5.1%) 144.68 (4.7%) 0.1% ( -9% - 10%) 0.957
MedTermDayTaxoFacets 17.57 (2.8%) 17.59 (2.8%) 0.2% ( -5% - 5%) 0.863
BrowseDayOfYearSSDVFacets 3.22 (0.9%) 3.23 (0.7%) 0.2% ( -1% - 1%) 0.424
HighTerm 401.13 (5.4%) 402.30 (5.7%) 0.3% ( -10% - 12%) 0.868
BrowseMonthSSDVFacets 3.27 (0.7%) 3.28 (1.0%) 0.3% ( -1% - 2%) 0.202
AndHighHigh 20.87 (2.8%) 20.94 (2.5%) 0.4% ( -4% - 5%) 0.670
HighSloppyPhrase 6.58 (3.3%) 6.61 (3.4%) 0.4% ( -6% - 7%) 0.727
AndHighMed 89.34 (1.8%) 89.76 (1.4%) 0.5% ( -2% - 3%) 0.355
BrowseDateSSDVFacets 0.95 (2.8%) 0.95 (4.0%) 0.5% ( -6% - 7%) 0.656
OrNotHighLow 420.17 (2.2%) 422.34 (2.2%) 0.5% ( -3% - 4%) 0.452
LowSloppyPhrase 2.89 (2.4%) 2.91 (1.9%) 0.6% ( -3% - 4%) 0.369
OrNotHighMed 219.50 (4.5%) 221.02 (4.1%) 0.7% ( -7% - 9%) 0.611
MedSloppyPhrase 10.44 (2.2%) 10.52 (1.4%) 0.7% ( -2% - 4%) 0.222
OrHighNotLow 288.48 (5.4%) 290.70 (5.7%) 0.8% ( -9% - 12%) 0.663
OrHighMed 53.25 (3.6%) 53.66 (3.5%) 0.8% ( -6% - 8%) 0.488
HighTermTitleBDVSort 2.77 (7.2%) 2.79 (7.0%) 0.9% ( -12% - 16%) 0.699
OrHighNotMed 270.38 (5.7%) 272.88 (5.4%) 0.9% ( -9% - 12%) 0.601
OrHighHigh 20.86 (5.1%) 21.08 (5.2%) 1.1% ( -8% - 12%) 0.519
OrHighLow 220.40 (4.2%) 223.52 (5.6%) 1.4% ( -8% - 11%) 0.367
BrowseRandomLabelSSDVFacets 2.32 (4.3%) 2.38 (7.6%) 2.3% ( -9% - 14%) 0.240
AndHighLow 395.82 (2.9%) 405.36 (2.8%) 2.4% ( -3% - 8%) 0.008
HighTermMonthSort 2375.71 (3.7%) 2555.72 (5.0%) 7.6% ( -1% - 16%) 0.000
PKLookup 140.60 (1.8%) 178.02 (1.3%) 26.6% ( 23% - 30%) 0.000
After
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Wildcard 37.77 (2.7%) 5.70 (1.3%) -84.9% ( -86% - -83%) 0.000
Prefix3 52.27 (2.7%) 22.71 (1.8%) -56.6% ( -59% - -53%) 0.000
Fuzzy1 59.96 (1.6%) 54.80 (2.3%) -8.6% ( -12% - -4%) 0.000
HighTermTitleSort 106.17 (2.1%) 101.30 (1.5%) -4.6% ( -7% - -1%) 0.000
Fuzzy2 33.40 (1.3%) 32.14 (1.6%) -3.8% ( -6% - 0%) 0.000
MedTerm 273.84 (5.4%) 265.92 (9.7%) -2.9% ( -17% - 12%) 0.245
HighTerm 349.66 (5.2%) 341.63 (8.9%) -2.3% ( -15% - 12%) 0.320
LowTerm 356.23 (3.1%) 350.27 (4.3%) -1.7% ( -8% - 5%) 0.156
LowSloppyPhrase 4.47 (2.1%) 4.44 (4.6%) -0.8% ( -7% - 6%) 0.492
HighSpanNear 8.12 (2.1%) 8.06 (2.5%) -0.7% ( -5% - 4%) 0.331
MedSloppyPhrase 31.05 (3.6%) 30.83 (4.1%) -0.7% ( -8% - 7%) 0.559
MedIntervalsOrdered 3.88 (3.3%) 3.86 (3.3%) -0.7% ( -7% - 6%) 0.519
MedSpanNear 8.94 (1.2%) 8.88 (1.6%) -0.7% ( -3% - 2%) 0.126
LowIntervalsOrdered 7.40 (3.3%) 7.35 (3.4%) -0.7% ( -7% - 6%) 0.537
LowSpanNear 29.33 (2.0%) 29.15 (2.2%) -0.6% ( -4% - 3%) 0.374
HighSloppyPhrase 6.68 (3.6%) 6.64 (3.5%) -0.5% ( -7% - 6%) 0.624
MedTermDayTaxoFacets 9.14 (3.0%) 9.11 (8.2%) -0.3% ( -11% - 11%) 0.861
HighPhrase 115.62 (3.9%) 115.24 (4.0%) -0.3% ( -7% - 7%) 0.798 [162/1927]
AndHighHigh 13.95 (4.2%) 13.92 (4.5%) -0.3% ( -8% - 8%) 0.847
BrowseMonthSSDVFacets 3.30 (0.8%) 3.29 (1.0%) -0.3% ( -2% - 1%) 0.377
AndHighMed 85.40 (2.0%) 85.18 (2.1%) -0.2% ( -4% - 3%) 0.695
IntNRQ 16.65 (4.1%) 16.63 (3.8%) -0.1% ( -7% - 8%) 0.914
BrowseRandomLabelSSDVFacets 2.30 (0.9%) 2.30 (1.1%) -0.1% ( -2% - 1%) 0.754
OrHighHigh 24.99 (6.1%) 24.97 (5.3%) -0.1% ( -10% - 12%) 0.957
AndHighHighDayTaxoFacets 2.29 (2.8%) 2.29 (2.4%) -0.0% ( -5% - 5%) 0.977
AndHighMedDayTaxoFacets 40.17 (1.4%) 40.20 (1.4%) 0.1% ( -2% - 2%) 0.872
OrHighMedDayTaxoFacets 3.15 (3.9%) 3.15 (3.4%) 0.1% ( -6% - 7%) 0.946
LowPhrase 30.23 (2.5%) 30.26 (2.4%) 0.1% ( -4% - 5%) 0.911
OrNotHighLow 201.78 (2.9%) 202.03 (3.0%) 0.1% ( -5% - 6%) 0.896
BrowseRandomLabelTaxoFacets 3.20 (3.2%) 3.21 (4.1%) 0.1% ( -6% - 7%) 0.899
HighIntervalsOrdered 0.42 (1.9%) 0.42 (1.6%) 0.2% ( -3% - 3%) 0.699
OrNotHighMed 235.49 (5.5%) 236.01 (4.7%) 0.2% ( -9% - 11%) 0.892
BrowseMonthTaxoFacets 3.62 (1.0%) 3.63 (1.0%) 0.2% ( -1% - 2%) 0.477
OrNotHighHigh 329.77 (4.9%) 330.79 (5.5%) 0.3% ( -9% - 11%) 0.851
MedPhrase 35.79 (3.4%) 35.90 (3.4%) 0.3% ( -6% - 7%) 0.771
TermDTSort 112.10 (3.2%) 112.45 (3.4%) 0.3% ( -6% - 7%) 0.763
BrowseDateSSDVFacets 0.97 (7.1%) 0.98 (10.0%) 0.4% ( -15% - 18%) 0.897
BrowseDayOfYearSSDVFacets 3.21 (2.2%) 3.22 (1.6%) 0.4% ( -3% - 4%) 0.525
HighTermDayOfYearSort 235.24 (2.1%) 236.16 (1.6%) 0.4% ( -3% - 4%) 0.512
OrHighMed 70.60 (3.3%) 70.99 (2.7%) 0.5% ( -5% - 6%) 0.571
AndHighLow 370.31 (3.2%) 372.60 (3.4%) 0.6% ( -5% - 7%) 0.559
HighTermTitleBDVSort 5.53 (4.1%) 5.56 (4.5%) 0.6% ( -7% - 9%) 0.648
OrHighNotLow 263.18 (5.6%) 264.95 (6.2%) 0.7% ( -10% - 13%) 0.717
OrHighNotMed 222.41 (5.8%) 224.06 (5.8%) 0.7% ( -10% - 13%) 0.688
OrHighNotHigh 233.04 (5.5%) 234.89 (5.8%) 0.8% ( -9% - 12%) 0.657
OrHighLow 463.17 (3.0%) 466.91 (3.1%) 0.8% ( -5% - 7%) 0.403
BrowseDayOfYearTaxoFacets 3.77 (0.6%) 3.84 (8.9%) 1.9% ( -7% - 11%) 0.342
BrowseDateTaxoFacets 3.74 (0.6%) 3.81 (8.8%) 1.9% ( -7% - 11%) 0.332
HighTermMonthSort 2350.73 (4.0%) 2477.32 (4.5%) 5.4% ( -2% - 14%) 0.000
Respell 30.81 (1.5%) 34.87 (1.7%) 13.2% ( 9% - 16%) 0.000
PKLookup 141.03 (1.9%) 177.54 (2.0%) 25.9% ( 21% - 30%) 0.000
```
I tried to modify the bench task file and only run `WildCard` to understand where the time is spent.
### My version
![image](https://github.com/apache/lucene/assets/7917591/cf0589da-befd-4250-a4fb-6d2e4f65c0cf)
### mainline
![image](https://github.com/apache/lucene/assets/7917591/dd8b820d-4471-429f-909d-55b1d5ff35d9)
So we can see that the most time is spent in actually reading out the FST arcs and FSA transitions... My intuitive explanation for why this is slower than the blocktree is that it has worse locality in its data access pattern. (@mikemccand maybe you can shed some light) Here are some relevant factors:
* The FST is larger as it contains all terms. So there are more Arcs to visit. Blocktree (main) use the FST to index prefixes.
* When binary-searching or directly address Arc/Transitions the target is somewhat random.
* The FST bytes are read backwards. (probably less of an issue if we read sequentially on modern HW)
* Blocktree at a given node reads bytes sequentially and terms are sorted, too.
Just out of curiosity I altered my code to load the FST on-heap to compare with the default off-heap option. It did not help much with `Wildcard` but PKLookup got substantially faster!
The PKLookup task is a great proxy to FST performance, as both versions of the code visits the exact same number of Arcs.
```
Off heap
Wildcard 47.56 (1.7%) 10.13 (0.4%) -78.7% ( -79% - -77%) 0.000
PKLookup 136.03 (2.4%) 147.93 (2.3%) 8.8% ( 3% - 13%) 0.000
on heap
Wildcard 37.11 (1.5%) 8.35 (0.3%) -77.5% ( -78% - -76%) 0.000
PKLookup 136.04 (3.3%) 269.60 (9.0%) 98.2% ( 83% - 114%) 0.000
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391564228
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
+
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler =
+ new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+ TermsIndexBuilder() {
+ Arrays.fill(countPerType, -1);
+ }
+
+ public void addTerm(BytesRef term, TermType termType) throws IOException {
+ IntsRefBuilder scratchInts = new IntsRefBuilder();
+ long ord = ++countPerType[termType.getId()];
+ fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+ }
+
+ public TermsIndex build() throws IOException {
+ return new TermsIndex(fstCompiler.compile());
+ }
+
+ private long encode(long ord, TermType termType) {
+ // use a single long to encode `ord` and `termType`
+ // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+ // so it looks like this |... ord ...| termType| ... hasOutput ...|
+ // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+ if (ord < 0) {
+ throw new IllegalArgumentException("can't encode negative ord");
+ }
+ if (ord > MAX_ORD) {
+ throw new IllegalArgumentException(
+ "Input ord is too large for TermType: "
Review Comment:
I see. That's a good point. I'll keep that in mind when implementing the main FieldsConsumer!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391027915
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
+
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler =
+ new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+ TermsIndexBuilder() {
+ Arrays.fill(countPerType, -1);
+ }
+
+ public void addTerm(BytesRef term, TermType termType) throws IOException {
+ IntsRefBuilder scratchInts = new IntsRefBuilder();
+ long ord = ++countPerType[termType.getId()];
+ fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+ }
+
+ public TermsIndex build() throws IOException {
+ return new TermsIndex(fstCompiler.compile());
+ }
+
+ private long encode(long ord, TermType termType) {
+ // use a single long to encode `ord` and `termType`
+ // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+ // so it looks like this |... ord ...| termType| ... hasOutput ...|
+ // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+ if (ord < 0) {
+ throw new IllegalArgumentException("can't encode negative ord");
+ }
+ if (ord > MAX_ORD) {
+ throw new IllegalArgumentException(
+ "Input ord is too large for TermType: "
Review Comment:
Well, maybe something like `Can only index XXX unique terms, but saw YYY terms for field ZZZ` or so? (If that's what this error message really means to the user).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381991598
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
Review Comment:
Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUPUT` value for the FST. so we peel off 4 bits in total.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381984403
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecImpl.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+import org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {
+ private final TermStateCodecComponent[] components;
+ private final int metadataBytesLength;
+
+ private static int getMetadataLength(TermStateCodecComponent component) {
+ // 1 byte for bitWidth; optionally 8 byte more for the reference value
+ return 1 + (component.isMonotonicallyIncreasing() ? 8 : 0);
+ }
+
+ public TermStateCodecImpl(TermStateCodecComponent[] components) {
+ assert components.length > 0;
+
+ this.components = components;
+ int metadataBytesLength = 0;
+ for (var component : components) {
+ metadataBytesLength += getMetadataLength(component);
+ }
+ this.metadataBytesLength = metadataBytesLength;
+ }
+
+ @Override
+ public byte[] encodeBlock(IntBlockTermState[] inputs, BitPacker bitPacker) {
+ Metadata[] metadataPerComponent = getMetadataPerComponent(inputs);
+ byte[] metadataBytes = serializeMetadata(metadataPerComponent);
+
+ // Encode inputs via the bitpacker
+ for (var termState : inputs) {
+ encodeOne(bitPacker, termState, metadataPerComponent);
+ }
+
+ return metadataBytes;
+ }
+
+ private Metadata[] getMetadataPerComponent(IntBlockTermState[] inputs) {
+ Metadata[] metadataPerComponent = new Metadata[components.length];
+ for (int i = 0; i < components.length; i++) {
+ var component = components[i];
+ byte bitWidth = TermStateCodecComponent.getBitWidth(inputs, component);
+ long referenceValue =
+ component.isMonotonicallyIncreasing() ? component.getTargetValue(inputs[0]) : 0L;
+ metadataPerComponent[i] = new Metadata(bitWidth, referenceValue);
+ }
+ return metadataPerComponent;
+ }
+
+ private byte[] serializeMetadata(Metadata[] metadataPerComponent) {
+ byte[] metadataBytes = new byte[this.metadataBytesLength];
+ ByteArrayDataOutput dataOut = new ByteArrayDataOutput(metadataBytes);
+
+ for (int i = 0; i < components.length; i++) {
+ var metadata = metadataPerComponent[i];
+ dataOut.writeByte(metadata.bitWidth);
+ if (components[i].isMonotonicallyIncreasing()) {
+ dataOut.writeLong(metadata.referenceValue);
+ }
+ }
+ return metadataBytes;
+ }
+
+ private void encodeOne(
Review Comment:
Yes, since the common access pattern is to grab all interesting states for one term as oppose to "I want to get all posting start FP for many terms"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "dungba88 (via GitHub)" <gi...@apache.org>.
dungba88 commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1438004582
##########
lucene/codecs/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static final long MAX_ORD = (1L << 60) - 1;
+
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler;
+
+ TermsIndexBuilder() throws IOException {
+ fstCompiler =
Review Comment:
Just FYI, you can plug the `indexOutput` directly to the FSTCompiler (passing from RandomAccessTermsDictWriter), and make the FST writing entirely off-heap (apart from the heap, it also eliminates the time taken to write to the on-heap DataOutput). I have some example PRs at:
- https://github.com/apache/lucene/pull/12985
- https://github.com/apache/lucene/pull/12980
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "nknize (via GitHub)" <gi...@apache.org>.
nknize commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1382078005
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########
Review Comment:
> I'm in the process of building the real compact bit packer.
+1. I'm assuming that will be added to this PR?
> The goal here is to pack a sequence of values that have different bitwidths... We can't use VLong either since we aim to write fixed size record, so that we can do random access
Hmm.. I'll have to look deeper at this. The reason I ask is because I did a similar bit packing w/ "random access" when serializing the [ShapeDocValues binary tree](https://github.com/apache/lucene/blob/d6836d3d0e5d33a98b35c0885b9787f46c4be47e/lucene/core/src/java/org/apache/lucene/document/ShapeDocValues.java#L462C26-L462C26) and it feels like we often re-implement this logic in different forms for different use cases. Can we generalize this and lean out our code base to make it more useable and readable?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391032959
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodec.java:
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.util.BytesRef;
+
+interface TermStateCodec {
Review Comment:
Hmm maybe don't name it `Codec` since it makes it seem like it could be a normal Lucene `Codec`? But this is much lower level ... maybe `TermStateEncoder`?
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodecImpl.java:
##########
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocFreq;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.LastPositionBlockOffset;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PayloadStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PositionStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SingletonDocId;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SkipOffset;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.TotalTermFreq;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {
Review Comment:
Do we really need the separable interface and impl yet? Are there more than one impl we are exploring?
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/bitpacking/FixedSizeByteArrayBitPacker.java:
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking;
+
+/**
+ * A {@link BitPacker} implementation that requires user to know the size of the resulting byte
+ * array upfront, in order to avoid allocation and copying for dynamically growing the array.
+ */
+public final class FixedSizeByteArrayBitPacker extends BitPackerImplBase {
+ private final byte[] bytes;
+ private int numBytesUsed = 0;
Review Comment:
No need for `= 0` -- it's already java's default.
##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TestTermStateCodecImpl.java:
##########
@@ -0,0 +1,283 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPerBytePacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpackerImpl;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.FixedSizeByteArrayBitPacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.util.BytesRef;
+
+public class TestTermStateCodecImpl extends LuceneTestCase {
+
+ public void testEncodeDecode() {
+ TermStateCodecImpl codec =
+ new TermStateCodecImpl(
+ new TermStateCodecComponent[] {
+ TermStateCodecComponent.DocFreq.INSTANCE, TermStateCodecComponent.DocStartFP.INSTANCE,
+ });
+
+ ArrayList<IntBlockTermState> termStates = new ArrayList<>();
+ long maxDocFreqSeen = -1;
+ long docStartFPBase = random().nextLong(Long.MAX_VALUE >> 1);
+ long maxDocStartFPDeltaSeen = -1;
+ for (int i = 0; i < random().nextInt(2, 256); i++) {
+ var termState = new IntBlockTermState();
+ termState.docFreq = random().nextInt(1, 1 << random().nextInt(1, 31));
+ if (i == 0) {
+ termState.docStartFP = docStartFPBase;
+ } else {
+ termState.docStartFP = termStates.get(i - 1).docStartFP + random().nextLong(1024);
+ maxDocStartFPDeltaSeen =
+ Math.max(maxDocStartFPDeltaSeen, termState.docStartFP - docStartFPBase);
+ }
+ maxDocFreqSeen = Math.max(maxDocFreqSeen, termState.docFreq);
+ termStates.add(termState);
+ }
+
+ IntBlockTermState[] termStatesArray = termStates.toArray(IntBlockTermState[]::new);
+
+ BitPerBytePacker bitPerBytePacker = new BitPerBytePacker();
+ byte[] metadata = codec.encodeBlock(termStatesArray, bitPerBytePacker);
+
+ // For the metadata, we expect
+ // 0: DocFreq.bitWidth,
+ // 1: DocStartFP.bitWidth,
+ // [2-10]: DocStartFP.referenceValue;
+ int expectedDocFreqBitWidth = 64 - Long.numberOfLeadingZeros(maxDocFreqSeen);
+ int expectedDocStartFPBitWidth = 64 - Long.numberOfLeadingZeros(maxDocStartFPDeltaSeen);
+ assertEquals(10, metadata.length);
+ assertEquals(expectedDocFreqBitWidth, metadata[0]);
+ assertEquals(expectedDocStartFPBitWidth, metadata[1]);
+ ByteArrayDataInput byteArrayDataInput = new ByteArrayDataInput(metadata, 2, 8);
+ assertEquals(docStartFPBase, byteArrayDataInput.readLong());
+
+ // Assert with real bit-packer we get the same bytes
+ FixedSizeByteArrayBitPacker fixedSizeByteArrayBitPacker =
+ new FixedSizeByteArrayBitPacker(bitPerBytePacker.getCompactBytes().length);
+ codec.encodeBlock(termStatesArray, fixedSizeByteArrayBitPacker);
+ assertArrayEquals(bitPerBytePacker.getCompactBytes(), fixedSizeByteArrayBitPacker.getBytes());
+
+ // Assert that each term state is the same after the encode-decode roundtrip.
+ BytesRef metadataBytes = new BytesRef(metadata);
+ BytesRef dataBytes = new BytesRef(bitPerBytePacker.getBytes());
+ assertBlockRoundTrip(termStatesArray, codec, metadataBytes, dataBytes, bitPerBytePacker);
+
+ // With real compact bits instead of bit-per-byte
+ dataBytes = new BytesRef(bitPerBytePacker.getCompactBytes());
+ assertBlockRoundTrip(
+ termStatesArray, codec, metadataBytes, dataBytes, BitUnpackerImpl.INSTANCE);
+
+ // Also test decoding that doesn't begin at the start of the block.
+ int pos = random().nextInt(termStatesArray.length);
+ int startBitIndex = pos > 0 ? random().nextInt(pos) : 0;
+ int recordSize = expectedDocFreqBitWidth + expectedDocStartFPBitWidth;
+ // With bit-per-byte bytes
+ dataBytes =
+ new BytesRef(bitPerBytePacker.getBytes(), pos * recordSize - startBitIndex, recordSize);
+ assertDecodeAt(
+ codec, metadataBytes, dataBytes, bitPerBytePacker, startBitIndex, termStatesArray[pos]);
+
+ // With compact bytes
+ int startByteIndex = pos * recordSize / 8;
+ int endByteIndex = (pos + 1) * recordSize / 8;
+ int length = endByteIndex - startByteIndex + ((pos + 1) * recordSize % 8 == 0 ? 0 : 1);
+ dataBytes = new BytesRef(bitPerBytePacker.getCompactBytes(), startByteIndex, length);
+ assertDecodeAt(
+ codec,
+ metadataBytes,
+ dataBytes,
+ BitUnpackerImpl.INSTANCE,
+ (pos * recordSize) % 8,
+ termStatesArray[pos]);
+ }
+
+ private static void assertDecodeAt(
+ TermStateCodecImpl codec,
+ BytesRef metadataBytes,
+ BytesRef dataBytes,
+ BitUnpacker bitUnpacker,
+ int startBitIndex,
+ IntBlockTermState termState) {
+ IntBlockTermState decoded =
+ codec.decodeAt(metadataBytes, dataBytes, bitUnpacker, startBitIndex);
+ assertEquals(termState.docFreq, decoded.docFreq);
+ assertEquals(termState.docStartFP, decoded.docStartFP);
+ }
+
+ private static void assertBlockRoundTrip(
+ IntBlockTermState[] termStatesArray,
+ TermStateCodecImpl codec,
+ BytesRef metadataBytes,
+ BytesRef dataBytes,
+ BitUnpacker bitUnpacker) {
+ for (int i = 0; i < termStatesArray.length; i++) {
+ IntBlockTermState decoded = codec.decodeWithinBlock(metadataBytes, dataBytes, bitUnpacker, i);
+ assertEquals(termStatesArray[i].docFreq, decoded.docFreq);
+ assertEquals(termStatesArray[i].docStartFP, decoded.docStartFP);
+ }
+ }
+
+ public void testGetCodec() {
+ for (IndexOptions indexOptions : IndexOptions.values()) {
+ if (indexOptions == IndexOptions.NONE) {
+ continue;
+ }
+ for (int i = 0; i < 8; i++) {
+ if ((i & 0b011) == 0b011) {
+ continue;
+ }
+ if ((i & 0b100) == 0b100
+ && indexOptions.ordinal() < IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+ continue;
+ }
+ TermType termType = TermType.fromId(i);
+ var expected = getExpectedCodec(termType, indexOptions);
+ var got = TermStateCodecImpl.getCodec(termType, indexOptions);
+ assertEquals(expected, got);
+ }
+ }
+ }
+
+ // Enumerate the expected Codec we get for (TermType, IndexOptions) pairs.
+ static TermStateCodecImpl getExpectedCodec(TermType termType, IndexOptions indexOptions) {
+ ArrayList<TermStateCodecComponent> components = new ArrayList<>();
+ // Wish I can code this better in java...
+ switch (termType.getId()) {
+ // Not singleton doc; No skip data; No last position block offset
+ case 0b000 -> {
+ assert !termType.hasLastPositionBlockOffset()
+ && !termType.hasSkipData()
+ && !termType.hasSingletonDoc();
+ components.add(TermStateCodecComponent.DocStartFP.INSTANCE);
+ if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS.ordinal()) {
+ components.add(TermStateCodecComponent.DocFreq.INSTANCE);
+ }
+ if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+ components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+ components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+ }
+ if (indexOptions.ordinal()
+ >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+ components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+ }
+ }
+ // Singleton doc; No skip data; No last position block offset
+ case 0b001 -> {
+ assert !termType.hasLastPositionBlockOffset()
+ && !termType.hasSkipData()
+ && termType.hasSingletonDoc();
+ components.add(TermStateCodecComponent.SingletonDocId.INSTANCE);
+ // If field needs frequency, we need totalTermsFreq.
+ // Since there is only 1 doc, totalTermsFreq == docFreq.
+ if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS.ordinal()) {
+ components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+ }
+ if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+ components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+ }
+ if (indexOptions.ordinal()
+ >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+ components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+ }
+ }
+
+ // Not Singleton doc; Has skip data; No last position block offset
+ case 0b010 -> {
+ assert !termType.hasLastPositionBlockOffset()
+ && termType.hasSkipData()
+ && !termType.hasSingletonDoc();
+ components.add(TermStateCodecComponent.DocStartFP.INSTANCE);
+ components.add(TermStateCodecComponent.SkipOffset.INSTANCE);
+ if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS.ordinal()) {
+ components.add(TermStateCodecComponent.DocFreq.INSTANCE);
+ }
+ if (indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal()) {
+ components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+ components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+ }
+ if (indexOptions.ordinal()
+ >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+ components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+ }
+ }
+ // Singleton doc but has skip data; Invalid state.
+ case 0b011, 0b111 -> {
+ assert termType.hasSkipData() && termType.hasSingletonDoc();
+ throw new IllegalStateException(
+ "Unreachable. A term has skip data but also only has one doc!? Must be a bug");
+ }
+ // Not singleton doc; No skip data; Has last position block offset;
+ case 0b100 -> {
+ assert termType.hasLastPositionBlockOffset()
+ && !termType.hasSkipData()
+ && !termType.hasSingletonDoc();
+ assert indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal();
+ components.add(TermStateCodecComponent.DocStartFP.INSTANCE);
+ components.add(TermStateCodecComponent.DocFreq.INSTANCE);
+ components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+ components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+ components.add(TermStateCodecComponent.LastPositionBlockOffset.INSTANCE);
+ if (indexOptions.ordinal()
+ >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+ components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+ }
+ }
+ // Singleton doc; No skip data; Has last position block offset;
+ case 0b101 -> {
+ assert termType.hasLastPositionBlockOffset()
+ && !termType.hasSkipData()
+ && termType.hasSingletonDoc();
+ assert indexOptions.ordinal() >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS.ordinal();
+ components.add(TermStateCodecComponent.SingletonDocId.INSTANCE);
+ components.add(TermStateCodecComponent.TotalTermFreq.INSTANCE);
+ components.add(TermStateCodecComponent.PositionStartFP.INSTANCE);
+ components.add(TermStateCodecComponent.LastPositionBlockOffset.INSTANCE);
+ if (indexOptions.ordinal()
+ >= IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.ordinal()) {
+ components.add(TermStateCodecComponent.PayloadStartFP.INSTANCE);
+ }
+ }
+ // Not singleton doc; Has skip data; Has last position block offset;
+ case 0b110 -> {
Review Comment:
Oooh I like the binary-valued case switches!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1860623376
Theres are very interesting results @Tony-X! I'll try to give deeper response soon, but one idea that jumped out about `Wildcard` is that BlockTree somewhere takes advantage of `commonSuffixBytes` or so? This is a `BytesRef` that is non-empty when all strings matched by the `Automaton` share some common suffix, as would happen with a wildcard query like `ab*cd` (`cd` would be the common suffix).
Maybe block tree is able to use this opto more effectively than the "all terms in FST" approach? But I think you could implement such an opto too, maybe: just find the one node (or, maybe more than one) whose suffix is the common suffix, and fast-check somehow?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381990655
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.util.Objects;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+/**
+ * TermType holds the classification of a term, based on how its postings are written.
+ *
+ * <p>It captures -- 1) if a term has a singleton docid (i.e. only one doc contains this term). 2)
+ * if the term has skip data. 3) if the term as an VINT encoded position block.
+ */
+final class TermType {
+ private static final byte SINGLETON_DOC_MASK = (byte) 1;
+
+ private static final byte HAS_SKIP_DATA_MASK = (byte) 1 << 1;
+
+ private static final byte HAS_VINT_POSITION_BLOCK_MASK = (byte) 1 << 2;
+
+ public static final int NUM_TOTAL_TYPES = 8;
Review Comment:
Hmm, 8 because even for the most demanding indexing options which asks for everything: doc, freq, position, payload and offset, there are three optional states singleton docid, skip offset and the last VINT position block offset.
For lighter indexing options, it is possible that the terms produced will have less than 8 types. For example, if position is not needed then we know we never have last VINT position block offset. So there can only be up to 4 types.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381991598
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
Review Comment:
Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUPUT` value for the FST.
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
+
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler =
+ new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+ TermsIndexBuilder() {
+ Arrays.fill(countPerType, -1);
+ }
+
+ public void addTerm(BytesRef term, TermType termType) throws IOException {
+ IntsRefBuilder scratchInts = new IntsRefBuilder();
+ long ord = ++countPerType[termType.getId()];
+ fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+ }
+
+ public TermsIndex build() throws IOException {
+ return new TermsIndex(fstCompiler.compile());
+ }
+
+ private long encode(long ord, TermType termType) {
+ // use a single long to encode `ord` and `termType`
+ // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+ // so it looks like this |... ord ...| termType| ... hasOutput ...|
+ // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+ if (ord < 0) {
+ throw new IllegalArgumentException("can't encode negative ord");
+ }
+ if (ord > MAX_ORD) {
+ throw new IllegalArgumentException(
+ "Input ord is too large for TermType: "
Review Comment:
+1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1770448647
I'll try to review this soon -- it sounds compelling @Tony-X! I like how it is inspired by Tantivy's term dictionary format (which holds all terms + their metadata in RAM).
Also, with the upcoming ability to [cleanly limit how much RAM the `FSTCompiler` is allowed to use to reduce the size of the FST](https://github.com/apache/lucene/pull/12633), this approach becomes more feasible. Without that change, the FST compilation might easily use excessive RAM during indexing when merging large segments.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1830166075
> This is reasonable as the terms index (FST) holds all the terms.
+1, nice!
> #### Fuzzy/Wildcard/Prefix queries got _much slower_
> This is also expected because currently I used the default implementation provided by `TermsEnum` which does not take advantage of the FST. With an optimized implementation I expect it to at least be on-par and slightly better because the FST holds information about all terms, whereas the current BlockTreeTerms only holds prefixes.
OK this makes sense, and it is a (sad) measure of how slow the emulated (on top of `seekCeil`) `.intersect` `TermsEnum` is. Once you have an optimized version it should likely be faster than block tree since it can intersect all suffixes instead of scanning `byte[]` suffixes in the term block and re-testing each.
> #### `HighTermTitleSort` and `HighTermMonthSort` got about 4.5% ~ 10% less throughput
> I don't quite understand why term lookup could affect sorting on a DV field
This is odd. Though, the `HighTermMonthSort` QPS is so crazy high as to not really be trustworthy -- likely BMW is kicking in and saving tons of work.
> #### `AndHighLow` got slower
>
> Am i missing some optimization opportunity for low freq terms?
Hmm maybe pulsing? Are we still inlining single-occurrence terms directly into the terms dict with your new terms dict?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1373799812
##########
lucene/core/src/java/module-info.java:
##########
@@ -35,6 +35,7 @@
exports org.apache.lucene.codecs.lucene95;
exports org.apache.lucene.codecs.lucene90.blocktree;
exports org.apache.lucene.codecs.lucene90.compressing;
+ exports org.apache.lucene.codecs.lucene90.radomaccess;
Review Comment:
good idea..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381580022
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.util.Objects;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+/**
+ * TermType holds the classification of a term, based on how its postings are written.
+ *
+ * <p>It captures -- 1) if a term has a singleton docid (i.e. only one doc contains this term). 2)
+ * if the term has skip data. 3) if the term as an VINT encoded position block.
Review Comment:
`term as a` -> `term has a`?
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecImpl.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+import org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {
+ private final TermStateCodecComponent[] components;
+ private final int metadataBytesLength;
+
+ private static int getMetadataLength(TermStateCodecComponent component) {
+ // 1 byte for bitWidth; optionally 8 byte more for the reference value
+ return 1 + (component.isMonotonicallyIncreasing() ? 8 : 0);
+ }
+
+ public TermStateCodecImpl(TermStateCodecComponent[] components) {
+ assert components.length > 0;
+
+ this.components = components;
+ int metadataBytesLength = 0;
+ for (var component : components) {
+ metadataBytesLength += getMetadataLength(component);
+ }
+ this.metadataBytesLength = metadataBytesLength;
+ }
+
+ @Override
+ public byte[] encodeBlock(IntBlockTermState[] inputs, BitPacker bitPacker) {
+ Metadata[] metadataPerComponent = getMetadataPerComponent(inputs);
+ byte[] metadataBytes = serializeMetadata(metadataPerComponent);
+
+ // Encode inputs via the bitpacker
+ for (var termState : inputs) {
+ encodeOne(bitPacker, termState, metadataPerComponent);
+ }
+
+ return metadataBytes;
+ }
+
+ private Metadata[] getMetadataPerComponent(IntBlockTermState[] inputs) {
+ Metadata[] metadataPerComponent = new Metadata[components.length];
+ for (int i = 0; i < components.length; i++) {
+ var component = components[i];
+ byte bitWidth = TermStateCodecComponent.getBitWidth(inputs, component);
+ long referenceValue =
+ component.isMonotonicallyIncreasing() ? component.getTargetValue(inputs[0]) : 0L;
+ metadataPerComponent[i] = new Metadata(bitWidth, referenceValue);
+ }
+ return metadataPerComponent;
+ }
+
+ private byte[] serializeMetadata(Metadata[] metadataPerComponent) {
+ byte[] metadataBytes = new byte[this.metadataBytesLength];
+ ByteArrayDataOutput dataOut = new ByteArrayDataOutput(metadataBytes);
+
+ for (int i = 0; i < components.length; i++) {
+ var metadata = metadataPerComponent[i];
+ dataOut.writeByte(metadata.bitWidth);
+ if (components[i].isMonotonicallyIncreasing()) {
+ dataOut.writeLong(metadata.referenceValue);
+ }
+ }
+ return metadataBytes;
+ }
+
+ private void encodeOne(
Review Comment:
OK it looks like it is not column stride -- each term is packing all of its metadata into one `byte[]` and then these `byte[]` are concatenated for all the terms.
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/Lucene90RandomAccessDictionaryPostingsFormat.java:
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.FieldsConsumer;
+import org.apache.lucene.codecs.FieldsProducer;
+import org.apache.lucene.codecs.PostingsFormat;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import org.apache.lucene.codecs.PostingsWriterBase;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsReader;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsWriter;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Similar to {@link Lucene90PostingsFormat} but with a different term dictionary implementation.
Review Comment:
It would be nice to spell out some high level details about how the format works :) E.g. single FST holding full terms, with `long` pointer (packed term type + up to 62 bit ordinal) to where full term metadata is packed on-disk. And the 8 different term type "groupings".
BTW I think FST will pack better if you put the 3 "term type" bits at the end (lsb) of the `long` instead of msb, because then prefixes can be shared across all term types, instead of just within each term type. But, this may not be a win because it may take more vLong bytes to write each delta, not sure :)
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.util.Objects;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+/**
+ * TermType holds the classification of a term, based on how its postings are written.
+ *
+ * <p>It captures -- 1) if a term has a singleton docid (i.e. only one doc contains this term). 2)
+ * if the term has skip data. 3) if the term as an VINT encoded position block.
+ */
+final class TermType {
+ private static final byte SINGLETON_DOC_MASK = (byte) 1;
+
+ private static final byte HAS_SKIP_DATA_MASK = (byte) 1 << 1;
+
+ private static final byte HAS_VINT_POSITION_BLOCK_MASK = (byte) 1 << 2;
+
+ public static final int NUM_TOTAL_TYPES = 8;
Review Comment:
8 because this is all the combinations/permutations of Lucene's different indexing options?
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecComponent.java:
##########
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+abstract class TermStateCodecComponent {
+
+ static byte getBitWidth(IntBlockTermState[] termStates, TermStateCodecComponent component) {
+ assert termStates.length > 0;
+
+ long maxValSeen = -1;
+ long referenceValue =
+ component.isMonotonicallyIncreasing() ? component.getTargetValue(termStates[0]) : 0;
+
+ for (var termState : termStates) {
+ maxValSeen = Math.max(maxValSeen, component.getTargetValue(termState) - referenceValue);
+ }
+ return (byte) (64 - Long.numberOfLeadingZeros(maxValSeen));
+ }
+
+ abstract boolean isMonotonicallyIncreasing();
+
+ abstract long getTargetValue(IntBlockTermState termState);
+
+ abstract void setTargetValue(IntBlockTermState termState, long value);
+
+ /** Below are the relevant IntBlockTermState components * */
+ static final class SingletonDocId extends TermStateCodecComponent {
+ public static SingletonDocId INSTANCE = new SingletonDocId();
+
+ private SingletonDocId() {}
+
+ @Override
+ public boolean isMonotonicallyIncreasing() {
+ return false;
+ }
+
+ @Override
+ public long getTargetValue(IntBlockTermState termState) {
+ return termState.singletonDocID;
+ }
+
+ @Override
+ public void setTargetValue(IntBlockTermState termState, long value) {
+ assert value <= Integer.MAX_VALUE;
+ // A correct codec implementation does not change the value,
+ // after the encode/decode round-trip it should still be a valid int
+ termState.singletonDocID = (int) value;
+ }
+ }
+
+ static final class DocFreq extends TermStateCodecComponent {
+ public static DocFreq INSTANCE = new DocFreq();
+
+ private DocFreq() {}
+
+ @Override
+ public boolean isMonotonicallyIncreasing() {
+ return false;
+ }
+
+ @Override
+ public long getTargetValue(IntBlockTermState termState) {
+ return termState.docFreq;
+ }
+
+ @Override
+ public void setTargetValue(IntBlockTermState termState, long value) {
+ assert value <= Integer.MAX_VALUE;
+ // A correct codec implementation does not change the value,
+ // after the encode/decode round-trip it should still be a valid int
+ termState.docFreq = (int) value;
+ }
+ }
+
+ static final class TotalTermFreq extends TermStateCodecComponent {
+ public static TotalTermFreq INSTANCE = new TotalTermFreq();
+
+ private TotalTermFreq() {}
+
+ @Override
+ public boolean isMonotonicallyIncreasing() {
+ return false;
+ }
+
+ @Override
+ public long getTargetValue(IntBlockTermState termState) {
+ return termState.totalTermFreq;
+ }
+
+ @Override
+ public void setTargetValue(IntBlockTermState termState, long value) {
+ termState.totalTermFreq = value;
+ }
+ }
+
+ static final class DocStartFP extends TermStateCodecComponent {
+ public static DocStartFP INSTANCE = new DocStartFP();
+
+ private DocStartFP() {}
+
+ @Override
+ public boolean isMonotonicallyIncreasing() {
+ return true;
+ }
+
+ @Override
+ public long getTargetValue(IntBlockTermState termState) {
+ return termState.docStartFP;
+ }
+
+ @Override
+ public void setTargetValue(IntBlockTermState termState, long value) {
+ termState.docStartFP = value;
+ }
+ }
+
+ static final class PositionStartFP extends TermStateCodecComponent {
+ public static PositionStartFP INSTANCE = new PositionStartFP();
+
+ private PositionStartFP() {}
+
+ @Override
+ public boolean isMonotonicallyIncreasing() {
+ return true;
+ }
+
+ @Override
+ public long getTargetValue(IntBlockTermState termState) {
+ return termState.posStartFP;
+ }
+
+ @Override
+ public void setTargetValue(IntBlockTermState termState, long value) {
+ termState.posStartFP = value;
+ }
+ }
+
+ static final class PayloadStartFP extends TermStateCodecComponent {
+ public static PayloadStartFP INSTANCE = new PayloadStartFP();
+
+ private PayloadStartFP() {}
+
+ @Override
+ public boolean isMonotonicallyIncreasing() {
+ return true;
+ }
+
+ @Override
+ public long getTargetValue(IntBlockTermState termState) {
+ return termState.payStartFP;
+ }
+
+ @Override
+ public void setTargetValue(IntBlockTermState termState, long value) {
+ termState.payStartFP = value;
+ }
+ }
+
+ static final class SkipOffset extends TermStateCodecComponent {
+ public static SkipOffset INSTANCE = new SkipOffset();
+
+ private SkipOffset() {}
+
+ @Override
+ public boolean isMonotonicallyIncreasing() {
+ return false;
+ }
+
+ @Override
+ public long getTargetValue(IntBlockTermState termState) {
+ return termState.skipOffset;
+ }
+
+ @Override
+ public void setTargetValue(IntBlockTermState termState, long value) {
+ termState.skipOffset = value;
+ }
+ }
+
+ static final class LastPositionBlockOffset extends TermStateCodecComponent {
Review Comment:
Are these per-term metadata stored column-stride in block of terms? I.e. all `LastPositionBlockOffset` for N terms in one block are encoded as a contiguous `byte[]`? (And same for `skipOffset`, `totalTermFreq`, etc.)?
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
Review Comment:
Because we peel off the three high bits of the `long` to encode the term type?
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
+
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler =
+ new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+ TermsIndexBuilder() {
+ Arrays.fill(countPerType, -1);
+ }
+
+ public void addTerm(BytesRef term, TermType termType) throws IOException {
+ IntsRefBuilder scratchInts = new IntsRefBuilder();
+ long ord = ++countPerType[termType.getId()];
+ fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+ }
+
+ public TermsIndex build() throws IOException {
+ return new TermsIndex(fstCompiler.compile());
+ }
+
+ private long encode(long ord, TermType termType) {
+ // use a single long to encode `ord` and `termType`
+ // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+ // so it looks like this |... ord ...| termType| ... hasOutput ...|
+ // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+ if (ord < 0) {
+ throw new IllegalArgumentException("can't encode negative ord");
+ }
+ if (ord > MAX_ORD) {
+ throw new IllegalArgumentException(
+ "Input ord is too large for TermType: "
Review Comment:
More user friendly message here too?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391575474
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodecImpl.java:
##########
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocFreq;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.LastPositionBlockOffset;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PayloadStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PositionStartFP;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SingletonDocId;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SkipOffset;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.TotalTermFreq;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {
Review Comment:
Probably not -- I started off with an interface (perhaps because i was influenced by developing with Rust) since it has an emphasis on what exactly are needed. Also it makes mocks easier. (Though I didn't need it)
Let's revisit this later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "dungba88 (via GitHub)" <gi...@apache.org>.
dungba88 commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1438004582
##########
lucene/codecs/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static final long MAX_ORD = (1L << 60) - 1;
+
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler;
+
+ TermsIndexBuilder() throws IOException {
+ fstCompiler =
Review Comment:
Just FYI, you can plug the `indexOutput` directly to the FSTCompiler (passing from RandomAccessTermsDictWriter), and make the FST writing entirely off-heap. I have some example PRs at:
- https://github.com/apache/lucene/pull/12985
- https://github.com/apache/lucene/pull/12980
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1860625197
The `PKLookup` gains are astounding!
Especially interesting is the off -> on heap gains for that task. We are somehow paying a high price for going through Lucene's IO APIs instead of `byte[]` backed `ByteBuffer`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1857380213
## Non-trivial amount of allocations for? .... building IndexInput slice descriptions !?
`jdk.internal.misc.Unsafe#allocateUninitializedArray()`. This was not trivial to find out why. But again with the raw JFR report, we can analyze the call tree. It turn out that in the [`buildSlice()` implementation](https://github.com/apache/lucene/blob/main/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L451) of `MemorySegmentIndexInput` we call [`IndexInput#getFullSliceDescription()`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/store/IndexInput.java#L128) which creates new String. And `allocateUninitializedArray` is called to allocate the bytes for the String.
AFAIK, the description is only used for debugging and tracking purposes. I didn't expect it'd cause that much of allocation. So I made a change to pass null when building the description so those allocations are gone.
```
Before
PERCENT HEAP SAMPLES STACK
10.39% 12103M java.lang.Long#valueOf()
9.91% 11543M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
8.91% 10383M jdk.internal.misc.Unsafe#allocateUninitializedArray()
After
PERCENT HEAP SAMPLES STACK [37/1812]
25.97% 32791M java.lang.Long#valueOf()
7.58% 9571M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
5.13% 6482M org.apache.lucene.util.FixedBitSet#<init>()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1862184623
Thanks @mikemccand for taking a look! I see the [getCommonSuffixBytesRef](https://github.com/apache/lucene/blob/5d6086e1994d766a3dd39a47b14a8cd80a7280e6/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java#L1161) method from `Automaton`.
I wonder if it is really applicable to the FST... i.e. for the FST is it guaranteed that there exists one and only one state where all valid outputs path that share the same suffix go through? Or put it in other words, how many sub-graphs of the FST are there that represents the same suffix?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "nknize (via GitHub)" <gi...@apache.org>.
nknize commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1382429884
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########
Review Comment:
> Since they are of the same size...
That's the difference. In your use case the records (blocks) are guaranteed to be the same size where as in the serialized tree case the records (tree nodes) are not guaranteed to be the same size. This is by design to ensure the resulting docvalue disk consumption is as efficient (small) as possible.
> ...by a quick glance it seems to me it encodes values with variable length (VInt, VLong). Maybe the random-access is achieved in different ways?
Yes to variable length encoding. The "random-ness" isn't purely random in that traversal of the serialized tree is DFS. Because the tree nodes are variable size the serialized array includes copious "book-keeping" in the form of "sizeOf" values. At DFS traversal the first "sizeOf" value provides the size of the entire left tree. To prune the left tree just means we skip that many bytes to get to the right tree.. this continues recursively. In practice we don't expect to ever "back up" in our DFS traversal so there is only a `rewind` method that simply resets the offset values to 0.
Seems the two use cases are subtly different but I could see roughly 80% overlap in the implementation. I'd love to efficiently encapsulate this logic for the next contributor that wants a random serialized traversal mechanism without a ridiculous amount of java object overhead. Sounds like @bruno-roustant had the same need? Could be a good follow on progress PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381991598
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
Review Comment:
Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUPUTT` value for the FST. so we peel off 4 bits in total.
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
Review Comment:
Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUTPUT` value for the FST. so we peel off 4 bits in total.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1796863568
Just realized that we have lucene99 Codec out! I'll update the code to reflect that as this posting format aims to work with the latest Codec.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1825228925
After some tweaking and tinkering I was finally able to get the bench running the way I wanted in luceneutil! @mikemccand
Unfortunately luceneutil out of the box doesn't work for my case...
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value [19/1802]
Wildcard 56.20 (2.0%) 7.30 (0.2%) -87.0% ( -87% - -86%) 0.000
Respell 44.00 (1.7%) 14.46 (0.6%) -67.1% ( -68% - -65%) 0.000
Fuzzy1 54.11 (1.2%) 20.86 (0.8%) -61.4% ( -62% - -60%) 0.000
Prefix3 103.61 (0.8%) 41.37 (0.5%) -60.1% ( -60% - -59%) 0.000
Fuzzy2 42.43 (1.2%) 20.26 (0.9%) -52.3% ( -53% - -50%) 0.000
HighTermTitleSort 127.60 (1.7%) 114.65 (1.6%) -10.1% ( -13% - -6%) 0.000
HighTermMonthSort 2549.22 (2.8%) 2435.74 (3.8%) -4.5% ( -10% - 2%) 0.037
AndHighLow 708.99 (2.3%) 678.15 (2.1%) -4.4% ( -8% - 0%) 0.001
LowTerm 369.16 (5.5%) 358.15 (3.1%) -3.0% ( -10% - 5%) 0.287
OrNotHighLow 258.51 (1.8%) 252.11 (2.2%) -2.5% ( -6% - 1%) 0.054
OrHighLow 348.58 (1.1%) 340.11 (2.5%) -2.4% ( -5% - 1%) 0.046
OrNotHighHigh 141.31 (6.7%) 138.54 (3.2%) -2.0% ( -11% - 8%) 0.554
IntNRQ 17.93 (1.8%) 17.62 (3.7%) -1.7% ( -7% - 3%) 0.349
MedSloppyPhrase 26.73 (0.8%) 26.28 (1.3%) -1.7% ( -3% - 0%) 0.017
HighIntervalsOrdered 3.88 (2.6%) 3.82 (2.3%) -1.6% ( -6% - 3%) 0.314
HighTerm 290.77 (7.3%) 286.20 (6.9%) -1.6% ( -14% - 13%) 0.727
OrHighNotHigh 296.51 (6.9%) 291.90 (4.9%) -1.6% ( -12% - 11%) 0.682
LowIntervalsOrdered 15.48 (1.1%) 15.26 (1.3%) -1.4% ( -3% - 0%) 0.057
MedTerm 405.54 (6.3%) 399.77 (5.9%) -1.4% ( -12% - 11%) 0.713
LowSloppyPhrase 44.64 (0.5%) 44.04 (2.1%) -1.3% ( -3% - 1%) 0.171
HighTermDayOfYearSort 177.36 (1.6%) 175.08 (1.8%) -1.3% ( -4% - 2%) 0.234
OrHighMed 79.44 (3.7%) 78.48 (3.0%) -1.2% ( -7% - 5%) 0.572
OrNotHighMed 268.82 (4.3%) 265.74 (3.1%) -1.1% ( -8% - 6%) 0.632
BrowseMonthTaxoFacets 3.89 (0.4%) 3.85 (1.2%) -0.9% ( -2% - 0%) 0.134
AndHighMedDayTaxoFacets 44.33 (0.7%) 43.96 (1.1%) -0.8% ( -2% - 0%) 0.145
MedSpanNear 30.67 (1.1%) 30.42 (2.3%) -0.8% ( -4% - 2%) 0.481
LowSpanNear 4.56 (0.8%) 4.53 (2.1%) -0.6% ( -3% - 2%) 0.576
HighSpanNear 8.52 (1.4%) 8.47 (2.1%) -0.5% ( -3% - 2%) 0.636
OrHighNotLow 236.32 (6.6%) 235.28 (4.4%) -0.4% ( -10% - 11%) 0.901
AndHighHighDayTaxoFacets 4.31 (0.6%) 4.29 (0.8%) -0.4% ( -1% - 1%) 0.446
LowPhrase 69.90 (1.2%) 69.67 (2.7%) -0.3% ( -4% - 3%) 0.807
AndHighMed 41.87 (0.6%) 41.75 (2.7%) -0.3% ( -3% - 3%) 0.828
OrHighHigh 45.86 (7.4%) 45.77 (7.9%) -0.2% ( -14% - 16%) 0.969
TermDTSort 101.90 (2.3%) 101.84 (2.0%) -0.1% ( -4% - 4%) 0.966
HighSloppyPhrase 0.48 (2.2%) 0.48 (2.6%) -0.0% ( -4% - 4%) 0.995
BrowseDateSSDVFacets 1.00 (3.7%) 1.00 (5.3%) -0.0% ( -8% - 9%) 1.000
HighTermTitleBDVSort 4.48 (5.4%) 4.49 (4.4%) 0.0% ( -9% - 10%) 0.990
BrowseDayOfYearSSDVFacets 3.51 (13.4%) 3.52 (13.3%) 0.1% ( -23% - 30%) 0.990
BrowseMonthSSDVFacets 3.38 (0.7%) 3.39 (0.6%) 0.1% ( -1% - 1%) 0.740
BrowseDayOfYearTaxoFacets 3.85 (0.3%) 3.85 (0.5%) 0.2% ( 0% - 0%) 0.469
MedIntervalsOrdered 1.65 (1.5%) 1.65 (1.5%) 0.2% ( -2% - 3%) 0.802
BrowseDateTaxoFacets 3.82 (0.3%) 3.83 (0.4%) 0.3% ( 0% - 0%) 0.264
BrowseRandomLabelSSDVFacets 2.34 (1.0%) 2.35 (0.5%) 0.3% ( -1% - 1%) 0.494
OrHighNotMed 296.34 (7.6%) 297.38 (4.6%) 0.4% ( -11% - 13%) 0.930
BrowseRandomLabelTaxoFacets 3.29 (0.6%) 3.31 (0.7%) 0.5% ( 0% - 1%) 0.275
MedPhrase 9.84 (2.1%) 9.90 (4.1%) 0.6% ( -5% - 6%) 0.787
AndHighHigh 33.50 (1.2%) 33.73 (3.4%) 0.7% ( -3% - 5%) 0.669
HighPhrase 15.15 (2.3%) 15.28 (4.1%) 0.9% ( -5% - 7%) 0.679
MedTermDayTaxoFacets 13.36 (1.8%) 13.55 (2.0%) 1.4% ( -2% - 5%) 0.240
OrHighMedDayTaxoFacets 3.73 (1.4%) 3.83 (2.2%) 2.6% ( 0% - 6%) 0.026
PKLookup 147.84 (1.6%) 157.37 (1.5%) 6.4% ( 3% - 9%) 0.000
```
### Observations:
#### PKLookup has improvement
This is reasonable as the terms index (FST) holds all the terms.
#### Fuzzy/Wildcard/Prefix queries got *much slower*
This is also expected because currently I used the default implementation provided by `TermsEnum` which does not take advantage of the FST. With an optimized implementation I expect it to at least be on-par and slightly better because the FST holds information about all terms, whereas the current BlockTreeTerms only holds prefixes.
#### `HighTermTitleSort` and `HighTermMonthSort` got about 4.5% ~ 10% less throughput
I don't quite understand why term lookup could
#### `AndHighLow` got slower
Am i missing some optimization opportunity for low freq terms?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381994228
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static long MAX_ORD = (1L << 60) - 1;
+
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler =
+ new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+ TermsIndexBuilder() {
+ Arrays.fill(countPerType, -1);
+ }
+
+ public void addTerm(BytesRef term, TermType termType) throws IOException {
+ IntsRefBuilder scratchInts = new IntsRefBuilder();
+ long ord = ++countPerType[termType.getId()];
+ fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+ }
+
+ public TermsIndex build() throws IOException {
+ return new TermsIndex(fstCompiler.compile());
+ }
+
+ private long encode(long ord, TermType termType) {
+ // use a single long to encode `ord` and `termType`
+ // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+ // so it looks like this |... ord ...| termType| ... hasOutput ...|
+ // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+ if (ord < 0) {
+ throw new IllegalArgumentException("can't encode negative ord");
+ }
+ if (ord > MAX_ORD) {
+ throw new IllegalArgumentException(
+ "Input ord is too large for TermType: "
Review Comment:
Did you mean to spell out the input ord?
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/Lucene90RandomAccessDictionaryPostingsFormat.java:
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.FieldsConsumer;
+import org.apache.lucene.codecs.FieldsProducer;
+import org.apache.lucene.codecs.PostingsFormat;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import org.apache.lucene.codecs.PostingsWriterBase;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsReader;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsWriter;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Similar to {@link Lucene90PostingsFormat} but with a different term dictionary implementation.
Review Comment:
> BTW I think FST will pack better if you put the 3 "term type" bits at the end (lsb) of the long instead of msb, because then prefixes can be shared across all term types, instead of just within each term type. But, this may not be a win because it may take more vLong bytes to write each delta, not sure :)
I believe this is what it is currently doing :) it will not create full 64bit long this way. To be specific --
* bit 0: indicate it is not `NO_OUTPUT` always set to 1 for valid (termType, ord)
* bit [1, 4): termType
* bit [4, 64): ord value
bit index is from low to high, bit 0 is least significant bit.
https://github.com/Tony-X/lucene/blob/ramdon_access_term_dict/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java#L68
[code](https://github.com/Tony-X/lucene/blob/ramdon_access_term_dict/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java#L68)
```java
return (ord << 4) | ((long) termType.getId() << 1) | 1L;
```
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########
Review Comment:
Hey Knize, thanks for reviewing.
This is not test-only. It is used by the `TestTermStateCodecImpl`. I'm in the process of building the real compact bit packer.
> I'm also curious every time I see a new bit packer as we do this a lot throughout the code. Is there some reuse from another class impl maybe? PackedInts? DataOutput#writeVLong? Do we need it?
I did search through the code base and couldn't find something I can use. The goal here is to pack a sequence of values that have different bitwidths . We can't use `PackedInts` as it requires values to have same bitwidth. We can't use VLong either since we aim to write fixed size record, so that we can do random access.
More detailed discussion can be found in this email thread: https://lists.apache.org/list?dev@lucene.apache.org:lte=1M:packedInts
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "bruno-roustant (via GitHub)" <gi...@apache.org>.
bruno-roustant commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1772589968
I'll also try to review!
On the bit packing subject, I have some handy generic code (not in Lucene yet) to write and read variable size bits. Tell me if you are interested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1890169118
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1792903331
> It seems like you have the low level encode/decode working? So all that remains is to hook that up with the Codec components that read/write the terms dict ... then you can test the Codec by passing -Dtests.codec=<name> and Lucene will run all tests cases using your Codec.
Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1382113535
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########
Review Comment:
> +1. I'm assuming that will be added to this PR?
Yes! This PR can be large so I took the advice from @mikemccand to open it early to avoid massive diff in one shot. Goal of the PR is to have a fully functional PostingFormat (or Codec).
> The reason I ask is because I did a similar bit packing w/ "random access" when serializing the [ShapeDocValues binary tree](https://github.com/apache/lucene/blob/d6836d3d0e5d33a98b35c0885b9787f46c4be47e/lucene/core/src/java/org/apache/lucene/document/ShapeDocValues.java#L462C26-L462C26) and it feels like we often re-implement this logic in different forms for different use cases. Can we generalize this and lean out our code base to make it more useable and readable?
Not sure if this code does the same thing. I could be wrong, but by a quick glance it seems to me it encodes values with variable length (VInt, VLong). Maybe the random-access is achieved in different ways?
Here in this PR, the use case is -- I have a bunch of bits in the form of byte[] that represents a block records that have same size (measure in bits, but size can be > 64 so we can't use PackedInts). Since they are of the same size, we can randomly access any record with an index and read the bits at [index * size, (index+1) * size]
I do agree that we should seek opportunities to unify. But for now since this is under sandbox, I'll make it specific to this implementation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "nknize (via GitHub)" <gi...@apache.org>.
nknize commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1380840508
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##########
Review Comment:
Looks like this is only used by tests? Maybe move to the tests package? I'm also curious every time I see a new bit packer as we do this a lot throughout the code. Is there some reuse from another class impl maybe? PackedInts? DataOutput#writeVLong? Do we need it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1808041346
> Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec?
You don't really need a new `Codec` -- you could test this easily by subclassing `Lucene99Codec` and overriding `getPostingsFormatForField` to use your new cool bit packing `PostingsFormat`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "bruno-roustant (via GitHub)" <gi...@apache.org>.
bruno-roustant commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1773923204
This is some code I wrote a long time ago. It has been tested and used, so
I'm confident on the functional aspect, and it might benefit from a
benchmark for perf.
Le ven. 20 oct. 2023 à 19:20, Tony-X ***@***.***> a écrit :
> Thanks @bruno-roustant <https://github.com/bruno-roustant> ! If you're
> okay to share it feel free to share it here.
>
> I'm in the process of baking my own specific implementation (which
> internally uses a single long as bit buffer), but I might absorb some
> interesting ideas from your impl.
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/lucene/pull/12688#issuecomment-1773113866>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AIC45DG57A67JAS2CWX4SDLYAKXF5AVCNFSM6AAAAAA6C3VWVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTGEYTGOBWGY>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1365702707
##########
lucene/core/src/java/module-info.java:
##########
@@ -35,6 +35,7 @@
exports org.apache.lucene.codecs.lucene95;
exports org.apache.lucene.codecs.lucene90.blocktree;
exports org.apache.lucene.codecs.lucene90.compressing;
+ exports org.apache.lucene.codecs.lucene90.radomaccess;
Review Comment:
Let's put this (experimental) Codec in `sandbox` to begin with?
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermType.java:
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+import java.util.Objects;
+
+class TermType {
+ private static final byte SINGLETON_DOC_MASK = (byte) 1;
+
+ private static final byte HAS_SKIP_DATA_MASK = (byte) 1 << 1;
+
+ private static final byte HAS_VINT_POSITION_BLOCK_MASK = (byte) 1 << 2;
+
+ public static final int NUM_TOTAL_TYPES = 8;
Review Comment:
Maybe add a comment explaining what a "type" is in this context? Is it all permutations of those three flags?
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the ordinals
+ * are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler =
+ new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+ TermsIndexBuilder() {
+ Arrays.fill(countPerType, -1);
+ }
+
+ public void addTerm(BytesRef term, TermType termType) throws IOException {
+ IntsRefBuilder scratchInts = new IntsRefBuilder();
+ long ord = ++countPerType[termType.getId()];
+ fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+ }
+
+ public TermsIndex build() throws IOException {
+ return new TermsIndex(fstCompiler.compile());
+ }
+
+ private long encode(long ord, TermType termType) {
+ // use a single long to encode `ord` and `termType`
+ // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+ // so it looks like this |... ord ...| termType| ... hasOutput ...|
+ // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+ if ( ord < 0) {
+ throw new IllegalArgumentException("can't encode negative ord");
+ }
+ if ( ord > ((1L << 60) - 1) ) {
+ throw new IllegalArgumentException("Input ord is too large");
Review Comment:
Maybe a more user-friendly error, like "too many terms of type X" or so? And maybe include the limit (`1L << 60`) in the exception message?
##########
lucene/core/src/java/module-info.java:
##########
@@ -35,6 +35,7 @@
exports org.apache.lucene.codecs.lucene95;
exports org.apache.lucene.codecs.lucene90.blocktree;
exports org.apache.lucene.codecs.lucene90.compressing;
+ exports org.apache.lucene.codecs.lucene90.radomaccess;
Review Comment:
Also, `radomaccess` is missing an `n` --> `randomaccess`.
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndex.java:
##########
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.util.fst.FST;
+
+
+record TermsIndex(FST<Long> fst) {
Review Comment:
Ooooh a `record`!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1373800366
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndex.java:
##########
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.util.fst.FST;
+
+
+record TermsIndex(FST<Long> fst) {
Review Comment:
yup! you get equals() and hashCode() for free!
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene90.radomaccess;
+
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the ordinals
+ * are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler =
+ new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, PositiveIntOutputs.getSingleton());
+
+ TermsIndexBuilder() {
+ Arrays.fill(countPerType, -1);
+ }
+
+ public void addTerm(BytesRef term, TermType termType) throws IOException {
+ IntsRefBuilder scratchInts = new IntsRefBuilder();
+ long ord = ++countPerType[termType.getId()];
+ fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+ }
+
+ public TermsIndex build() throws IOException {
+ return new TermsIndex(fstCompiler.compile());
+ }
+
+ private long encode(long ord, TermType termType) {
+ // use a single long to encode `ord` and `termType`
+ // also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+ // so it looks like this |... ord ...| termType| ... hasOutput ...|
+ // where termType takes 3 bit and hasOutput takes the lowest bit. The rest is taken by ord
+ if ( ord < 0) {
+ throw new IllegalArgumentException("can't encode negative ord");
+ }
+ if ( ord > ((1L << 60) - 1) ) {
+ throw new IllegalArgumentException("Input ord is too large");
Review Comment:
Yeah, makes sense. your suggestion would make the error more usable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "dungba88 (via GitHub)" <gi...@apache.org>.
dungba88 commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1438014984
##########
lucene/codecs/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermsIndexBuilder.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> (type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+ private static final long MAX_ORD = (1L << 60) - 1;
+
+ private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+ private final FSTCompiler<Long> fstCompiler;
+
+ TermsIndexBuilder() throws IOException {
+ fstCompiler =
Review Comment:
Note that the FST metadata and data cannot be written to the same file, but it seems you already separated `metaOutput` and `indexOutput` so that should be fine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391566961
##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodec.java:
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.util.BytesRef;
+
+interface TermStateCodec {
Review Comment:
I know.... I started off calling it `TermStateEncoder` but I soon realized it needs decoding, too.
I'm open to change its names :)
One good thing is that this is totally implementation detail which needs not to be exposed outside this package.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Posted by "Tony-X (via GitHub)" <gi...@apache.org>.
Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1857371557
Since the first working version, I iterated with a list of profiling-guided allocation optimizations, as they stood out quite obviously from the merged JFR reports (thanks to luceneutil !).
Some of them comes from my code that implements the term dictionary data lookup, and a few of them are at more general Lucene level. I want to highlight the general issue I see from this work and maybe we can have separate issues to improve them!
Here is the first heap profile comparison (search-only, no indexing).
```
Candidate Heap
17.50% 24440M java.lang.Long#valueOf()
10.09% 14096M jdk.internal.misc.Unsafe#allocateUninitializedArray()
6.87% 9594M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
4.40% 6140M org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
...
```
```
main
13.65% 11898M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
9.26% 8071M org.apache.lucene.util.FixedBitSet#<init>()
6.70% 5836M org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
6.60% 5751M org.apache.lucene.util.ArrayUtil#growExact()
5.21% 4541M org.apache.lucene.facet.FacetsConfig#stringToPath()
4.69% 4090M org.apache.lucene.util.DocIdSetBuilder$Buffer#<init>()
```
## FST doesn't play nicely with primitive types (I know, this is more or less a java issue)
`24440M java.lang.Long#valueOf()` huge amount of allocations... This is obvious. The FST<T> implementation is generic over its output type and in my case T is `Long`. So for trivial `long` add and subtract, the implementation would allocate an object. Not only it is wasteful but from a perf perspective it'd be less than 1 CPU cycle v.s. calling allocator which is easily tens if not hundreds of cycles.
For this work, I forked the FST<T> class and manually templated it with long just to see how much difference it makes. Here is a diff in heap profile and bench results before and after.
```
Before
PERCENT HEAP SAMPLES STACK
25.97% 32791M java.lang.Long#valueOf()
7.58% 9571M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
5.13% 6482M org.apache.lucene.util.FixedBitSet#<init>()
4.90%
....
After
PERCENT HEAP SAMPLES STACK
8.44% 7988M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
7.17% 6788M org.apache.lucene.util.FixedBitSet#<init>()
6.22% 5886M org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
5.89% 5577M org.apache.lucene.util.ArrayUtil#growExact()
```
Bench diff
```
Before
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Wildcard 11.61 (2.7%) 2.40 (0.6%) -79.4% ( -80% - -78%) 0.000
Fuzzy1 78.17 (0.7%) 27.16 (0.9%) -65.3% ( -66% - -64%) 0.000
Respell 29.09 (0.6%) 10.91 (0.8%) -62.5% ( -63% - -61%) 0.000
Fuzzy2 47.80 (0.6%) 18.50 (1.2%) -61.3% ( -62% - -59%) 0.000
Prefix3 765.08 (3.1%) 463.94 (0.9%) -39.4% ( -42% - -36%) 0.000
HighTermTitleSort 98.48 (2.0%) 90.62 (2.2%) -8.0% ( -11% - -3%) 0.000
BrowseMonthTaxoFacets 3.89 (29.2%) 3.62 (0.9%) -6.9% ( -28% - 32%) 0.293
LowSloppyPhrase 22.73 (6.5%) 22.35 (6.9%) -1.7% ( -14% - 12%) 0.432
LowTerm 365.47 (3.6%) 359.62 (2.9%) -1.6% ( -7% - 5%) 0.121
HighTerm 398.57 (5.1%) 393.16 (4.7%) -1.4% ( -10% - 8%) 0.380
MedSloppyPhrase 10.63 (3.6%) 10.51 (3.7%) -1.1% ( -8% - 6%) 0.339
MedTerm 422.73 (4.2%) 418.60 (4.0%) -1.0% ( -8% - 7%) 0.451
MedTermDayTaxoFacets 14.84 (2.6%) 14.71 (2.5%) -0.8% ( -5% - 4%) 0.296
HighSloppyPhrase 12.41 (3.1%) 12.33 (3.1%) -0.7% ( -6% - 5%) 0.487
HighTermTitleBDVSort 6.88 (3.3%) 6.84 (3.5%) -0.6% ( -7% - 6%) 0.599
LowPhrase 58.15 (2.9%) 57.85 (2.8%) -0.5% ( -6% - 5%) 0.567
BrowseDayOfYearSSDVFacets 3.24 (0.4%) 3.23 (0.5%) -0.3% ( -1% - 0%) 0.042
MedPhrase 26.19 (3.1%) 26.11 (3.2%) -0.3% ( -6% - 6%) 0.775
OrNotHighMed 185.23 (3.9%) 184.73 (3.3%) -0.3% ( -7% - 7%) 0.813
OrHighMedDayTaxoFacets 3.82 (3.3%) 3.81 (3.2%) -0.3% ( -6% - 6%) 0.796
OrHighNotLow 194.98 (5.1%) 194.51 (4.6%) -0.2% ( -9% - 10%) 0.875
OrHighNotMed 337.15 (4.4%) 336.53 (3.8%) -0.2% ( -7% - 8%) 0.888
IntNRQ 67.60 (0.9%) 67.55 (1.0%) -0.1% ( -1% - 1%) 0.783
MedSpanNear 9.85 (1.4%) 9.84 (2.1%) -0.1% ( -3% - 3%) 0.906
OrNotHighHigh 205.12 (4.1%) 205.01 (3.9%) -0.1% ( -7% - 8%) 0.967
AndHighHighDayTaxoFacets 6.35 (1.5%) 6.34 (1.7%) -0.0% ( -3% - 3%) 0.932
BrowseMonthSSDVFacets 3.29 (0.8%) 3.29 (0.7%) -0.0% ( -1% - 1%) 0.887
BrowseRandomLabelSSDVFacets 2.30 (0.8%) 2.30 (1.0%) 0.0% ( -1% - 1%) 0.919
LowSpanNear 16.41 (2.6%) 16.42 (2.7%) 0.1% ( -5% - 5%) 0.931
HighPhrase 77.12 (3.0%) 77.20 (3.6%) 0.1% ( -6% - 6%) 0.923
AndHighMedDayTaxoFacets 39.64 (1.2%) 39.68 (1.0%) 0.1% ( -2% - 2%) 0.742
BrowseRandomLabelTaxoFacets 3.19 (1.6%) 3.19 (1.1%) 0.1% ( -2% - 2%) 0.728
BrowseDateTaxoFacets 3.73 (0.7%) 3.74 (0.5%) 0.3% ( 0% - 1%) 0.157
AndHighHigh 27.08 (1.3%) 27.15 (3.0%) 0.3% ( -4% - 4%) 0.718
BrowseDayOfYearTaxoFacets 3.76 (0.6%) 3.77 (0.5%) 0.3% ( 0% - 1%) 0.072
HighTermDayOfYearSort 224.01 (2.1%) 224.81 (2.1%) 0.4% ( -3% - 4%) 0.592
HighSpanNear 6.09 (2.7%) 6.11 (3.1%) 0.4% ( -5% - 6%) 0.683
HighIntervalsOrdered 8.08 (3.3%) 8.11 (3.4%) 0.4% ( -6% - 7%) 0.705
TermDTSort 103.29 (4.4%) 103.83 (3.1%) 0.5% ( -6% - 8%) 0.669
MedIntervalsOrdered 33.12 (4.4%) 33.29 (4.6%) 0.5% ( -8% - 9%) 0.702
LowIntervalsOrdered 10.06 (3.9%) 10.12 (3.6%) 0.6% ( -6% - 8%) 0.609
AndHighMed 73.71 (2.2%) 74.18 (2.5%) 0.6% ( -3% - 5%) 0.394
OrHighMed 71.44 (2.7%) 71.98 (3.3%) 0.7% ( -5% - 6%) 0.429
BrowseDateSSDVFacets 0.96 (4.8%) 0.97 (5.7%) 0.9% ( -9% - 11%) 0.601
OrHighNotHigh 308.82 (4.0%) 311.53 (3.7%) 0.9% ( -6% - 8%) 0.470
OrHighLow 404.69 (3.0%) 408.63 (3.5%) 1.0% ( -5% - 7%) 0.348
OrHighHigh 20.44 (4.7%) 20.73 (7.2%) 1.4% ( -10% - 13%) 0.469
OrNotHighLow 381.28 (1.8%) 388.18 (2.1%) 1.8% ( -2% - 5%) 0.004
HighTermMonthSort 2500.04 (2.2%) 2554.91 (4.3%) 2.2% ( -4% - 8%) 0.042
AndHighLow 668.12 (3.1%) 692.04 (3.9%) 3.6% ( -3% - 10%) 0.001
PKLookup 140.25 (2.0%) 168.53 (1.9%) 20.2% ( 15% - 24%) 0.000
After
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Wildcard 54.96 (2.6%) 10.43 (0.5%) -81.0% ( -82% - -80%) 0.000
Respell 45.54 (1.0%) 16.74 (0.7%) -63.2% ( -64% - -62%) 0.000
Fuzzy1 46.41 (1.2%) 17.26 (1.0%) -62.8% ( -64% - -61%) 0.000
Prefix3 121.65 (2.4%) 55.57 (0.9%) -54.3% ( -56% - -52%) 0.000
Fuzzy2 32.33 (1.2%) 15.79 (1.1%) -51.2% ( -52% - -49%) 0.000
HighTermTitleSort 95.24 (2.1%) 87.04 (1.9%) -8.6% ( -12% - -4%) 0.000
BrowseRandomLabelSSDVFacets 2.37 (7.1%) 2.33 (4.8%) -1.7% ( -12% - 10%) 0.374
BrowseMonthSSDVFacets 3.34 (7.3%) 3.29 (0.6%) -1.5% ( -8% - 6%) 0.362
TermDTSort 120.57 (2.3%) 119.02 (3.4%) -1.3% ( -6% - 4%) 0.163
OrHighHigh 19.13 (5.6%) 18.92 (2.7%) -1.1% ( -8% - 7%) 0.430
AndHighHigh 22.04 (5.1%) 21.87 (3.0%) -0.8% ( -8% - 7%) 0.555
AndHighMed 55.06 (3.0%) 54.79 (2.1%) -0.5% ( -5% - 4%) 0.546
HighSpanNear 3.29 (1.6%) 3.28 (1.7%) -0.5% ( -3% - 2%) 0.346
HighIntervalsOrdered 0.65 (1.8%) 0.65 (2.0%) -0.5% ( -4% - 3%) 0.433
HighTermDayOfYearSort 282.86 (2.0%) 281.57 (2.6%) -0.5% ( -4% - 4%) 0.533
MedIntervalsOrdered 16.36 (1.5%) 16.29 (1.5%) -0.4% ( -3% - 2%) 0.369
OrHighMed 68.27 (2.9%) 67.99 (1.8%) -0.4% ( -5% - 4%) 0.598
MedSpanNear 3.22 (1.0%) 3.21 (1.4%) -0.4% ( -2% - 2%) 0.317
HighSloppyPhrase 9.59 (2.5%) 9.57 (2.6%) -0.3% ( -5% - 4%) 0.733
BrowseMonthTaxoFacets 3.64 (2.4%) 3.63 (1.8%) -0.2% ( -4% - 4%) 0.756
LowIntervalsOrdered 14.66 (0.9%) 14.63 (1.5%) -0.2% ( -2% - 2%) 0.633
MedTermDayTaxoFacets 15.56 (2.7%) 15.54 (3.9%) -0.2% ( -6% - 6%) 0.879
AndHighMedDayTaxoFacets 18.70 (1.4%) 18.67 (3.7%) -0.2% ( -5% - 5%) 0.864
LowSpanNear 4.39 (1.1%) 4.38 (1.4%) -0.1% ( -2% - 2%) 0.728
OrHighMedDayTaxoFacets 5.38 (3.5%) 5.38 (5.4%) -0.1% ( -8% - 9%) 0.945
AndHighHighDayTaxoFacets 7.06 (1.6%) 7.06 (3.0%) -0.1% ( -4% - 4%) 0.924
LowSloppyPhrase 7.16 (1.4%) 7.15 (1.6%) -0.1% ( -2% - 2%) 0.891
MedSloppyPhrase 128.54 (1.9%) 128.56 (2.2%) 0.0% ( -4% - 4%) 0.979
LowTerm 417.80 (3.3%) 418.01 (2.7%) 0.1% ( -5% - 6%) 0.958
LowPhrase 125.59 (4.0%) 125.77 (3.1%) 0.1% ( -6% - 7%) 0.900
OrHighLow 313.22 (2.1%) 313.72 (2.2%) 0.2% ( -4% - 4%) 0.817
BrowseDateTaxoFacets 3.73 (0.7%) 3.74 (0.7%) 0.2% ( -1% - 1%) 0.470
BrowseDayOfYearTaxoFacets 3.76 (0.7%) 3.76 (0.7%) 0.2% ( -1% - 1%) 0.457
MedTerm 384.57 (4.6%) 385.44 (3.6%) 0.2% ( -7% - 8%) 0.863
OrHighNotHigh 255.07 (4.3%) 256.05 (4.3%) 0.4% ( -7% - 9%) 0.778
MedPhrase 11.17 (3.0%) 11.21 (2.6%) 0.4% ( -5% - 6%) 0.658
HighTerm 361.26 (5.1%) 362.86 (4.2%) 0.4% ( -8% - 10%) 0.764
BrowseRandomLabelTaxoFacets 3.19 (1.5%) 3.20 (0.6%) 0.5% ( -1% - 2%) 0.203
OrNotHighHigh 205.38 (4.0%) 206.35 (4.0%) 0.5% ( -7% - 8%) 0.712
OrNotHighLow 317.96 (1.7%) 319.48 (2.1%) 0.5% ( -3% - 4%) 0.428
HighPhrase 47.91 (3.8%) 48.15 (3.3%) 0.5% ( -6% - 7%) 0.661
BrowseDateSSDVFacets 0.97 (6.9%) 0.98 (6.7%) 0.5% ( -12% - 15%) 0.801
OrHighNotLow 185.96 (4.9%) 187.04 (5.0%) 0.6% ( -8% - 11%) 0.710
BrowseDayOfYearSSDVFacets 3.21 (2.1%) 3.23 (0.9%) 0.6% ( -2% - 3%) 0.225
HighTermTitleBDVSort 5.83 (3.7%) 5.87 (4.0%) 0.7% ( -6% - 8%) 0.584
OrNotHighMed 516.84 (2.5%) 520.76 (2.5%) 0.8% ( -4% - 5%) 0.334
IntNRQ 29.24 (3.0%) 29.50 (4.1%) 0.9% ( -6% - 8%) 0.425
OrHighNotMed 268.45 (4.4%) 270.92 (4.2%) 0.9% ( -7% - 9%) 0.501
HighTermMonthSort 2498.46 (4.8%) 2590.43 (3.7%) 3.7% ( -4% - 12%) 0.007
AndHighLow 747.94 (2.1%) 775.60 (4.0%) 3.7% ( -2% - 10%) 0.000
PKLookup 141.68 (2.0%) 177.85 (1.5%) 25.5% ( 21% - 29%) 0.000
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org