You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/15 20:44:25 UTC

[GitHub] [lucene] zhaih opened a new pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

zhaih opened a new pull request #442:
URL: https://github.com/apache/lucene/pull/442


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene:
   
   * https://issues.apache.org/jira/projects/LUCENE
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   
   LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   As mentioned in the issue, use NumericDocValues to store parent array instead of term positioning.
   
   Benchmark results available in JIRA, in short we're seeing a slightly larger taxonomy index and plain performance
   
   Added some if-branches for backward compatibility.
   
   ## Misc
   I've unified my names in "CHNAGES.txt" to be "Patrick Zhai" (Previously sometimes I used "Haoyu Zhai")
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes. (Old tests should be sufficient)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r751706449



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {
+      // The parent array is already loaded
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {

Review comment:
       LOL I would feel more comfortable if that `assert` checked that the MP is actually `Log*MP`, not just 'not TMP` :)  But we don't need to fix that now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r751545923



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -466,18 +476,22 @@ protected final void ensureOpen() {
    * effectively synchronized as well.
    */
   private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOException {
-    // Before Lucene 2.9, position increments >=0 were supported, so we
-    // added 1 to parent to allow the parent -1 (the parent of the root).
-    // Unfortunately, starting with Lucene 2.9, after LUCENE-1542, this is
-    // no longer enough, since 0 is not encoded consistently either (see
-    // comment in SinglePositionTokenStream). But because we must be
-    // backward-compatible with existing indexes, we can't just fix what
-    // we write here (e.g., to write parent+2), and need to do a workaround
-    // in the reader (which knows that anyway only category 0 has a parent
-    // -1).
-    parentStream.set(Math.max(parent + 1, 1));
     Document d = new Document();
-    d.add(parentStreamField);
+    if (useOlderTermPositionForParentOrdinal) {
+      // Before Lucene 2.9, position increments >=0 were supported, so we
+      // added 1 to parent to allow the parent -1 (the parent of the root).
+      // Unfortunately, starting with Lucene 2.9, after LUCENE-1542, this is
+      // no longer enough, since 0 is not encoded consistently either (see
+      // comment in SinglePositionTokenStream). But because we must be
+      // backward-compatible with existing indexes, we can't just fix what
+      // we write here (e.g., to write parent+2), and need to do a workaround
+      // in the reader (which knows that anyway only category 0 has a parent
+      // -1).
+      parentStream.set(Math.max(parent + 1, 1));

Review comment:
       OK!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r751350351



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,40 +125,82 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues =
+          leafContext.reader().getNumericDocValues(Consts.FIELD_PARENT_ORDINAL_NDV);
+      if (parentValues == null) {
+        throw new CorruptIndexException(
+            "Parent data field " + Consts.FIELD_PARENT_ORDINAL_NDV + "not exists",
+            leafContext.reader().toString());
+      }
+
+      for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) {
+        if (parentValues.advanceExact(doc) == false) {
+          throw new CorruptIndexException(
+              "Missing parent data for category " + (doc + leafContext.docBase), reader.toString());
+        }
+        // we're putting an int and converting it back so it should be safe
+        parents[doc + leafContext.docBase] = Math.toIntExact(parentValues.longValue());
+      }
+    }
+  }
+
+  /**
+   * Try loading the old way of storing parent ordinal first, return true if the parent array is
+   * loaded Or false if not, and we will try loading using NumericDocValues
+   */
+  // TODO: Remove in Lucene 10, this is only for back-compatibility
+  private boolean tryLoadParentUsingTermPosition(IndexReader reader, int first) throws IOException {
     // it's ok to use MultiTerms because we only iterate on one posting list.
     // breaking it to loop over the leaves() only complicates code for no
     // apparent gain.
     PostingsEnum positions =
         MultiTerms.getTermPostingsEnum(
             reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
 
+    if (positions == null) {
+      // try using NumericDocValues then
+      return false;
+    }
+
     // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
+    if (positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
       throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
+          "[Lucene 8]Missing parent data for category " + first, reader.toString());

Review comment:
       Could we add a space between the `[Lucene 8]` and the message?  E.g. `[Lucene 8] Missing ...`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r751348311



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {

Review comment:
       Hrmph, yes -- this is the `near-real-time` case, where we are opening a  taxo reader from a taxo writer before any segment infos have been committed.  In this case, we know the index is created with the latest (current) version of Lucene, so maybe we could catch the `IndexNotFoundException` and add a comment that this is OK?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r750733313



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {
+      // The parent array is already loaded
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues =
+          leafContext.reader().getNumericDocValues(Consts.FIELD_PARENT_ORDINAL_NDV);
+      for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) {
+        if (parentValues.advanceExact(doc) == false) {
+          throw new CorruptIndexException(
+              "Missing parent data for category " + doc + leafContext.docBase, reader.toString());

Review comment:
       Oh good point, yeah I'm not sure about that as well...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r750831307



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {

Review comment:
       Hmmm seems not all index have segments file?
   https://github.com/apache/lucene/runs/4232924356?check_suite_focus=true#step:7:194




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r750733549



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {
+      // The parent array is already loaded
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues =
+          leafContext.reader().getNumericDocValues(Consts.FIELD_PARENT_ORDINAL_NDV);

Review comment:
       Yeah good idea!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r750735067



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {
+      // The parent array is already loaded
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {

Review comment:
       Yes :)
   https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java#L152-L154




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r750733364



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {

Review comment:
       OK!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r752459454



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,6 +125,46 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (getMajorVersion(reader) <= 8) {
+      loadParentUsingTermPosition(reader, first);
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues =
+          leafContext.reader().getNumericDocValues(Consts.FIELD_PARENT_ORDINAL_NDV);
+      if (parentValues == null) {
+        throw new CorruptIndexException(
+            "Parent data field " + Consts.FIELD_PARENT_ORDINAL_NDV + " not exists",
+            leafContext.reader().toString());
+      }
+
+      for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) {
+        if (parentValues.advanceExact(doc) == false) {
+          throw new CorruptIndexException(
+              "Missing parent data for category " + (doc + leafContext.docBase), reader.toString());
+        }
+        // we're putting an int and converting it back so it should be safe
+        parents[doc + leafContext.docBase] = Math.toIntExact(parentValues.longValue());
+      }
+    }
+  }
+
+  private static int getMajorVersion(IndexReader reader) {
+    return reader.leaves().get(0).reader().getMetaData().getCreatedVersionMajor();

Review comment:
       Yeah it is guaranteed to be non-empty, but it might be a good idea adding an assertion (fortunately this is just a private method




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
gsmiller commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r752558188



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -92,15 +93,18 @@
   private final Directory dir;
   private final IndexWriter indexWriter;
   private final boolean useOlderStoredFieldIndex;
+  private final boolean useOlderTermPositionForParentOrdinal;

Review comment:
       Just merged this into my working branch. Thanks for making that merge so easy!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
gsmiller commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r751355800



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {

Review comment:
       The major version used to create each segment is available. Can you leverage that? It's consistent with our approach in #443 for what it's worth.
   
   For example: `reader.leaves().get(0).reader().getMetaData().getCreatedVersionMajor() <= 8`

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -92,15 +93,18 @@
   private final Directory dir;
   private final IndexWriter indexWriter;
   private final boolean useOlderStoredFieldIndex;
+  private final boolean useOlderTermPositionForParentOrdinal;

Review comment:
       Could you have a quick look at what I did in #443 and consider copying the same idea? It just collapses these two booleans into a single one that basically says "am I working with an 8.x or earlier index?" I need the same idea in #443 so it might be nice to collapse everything down to this one boolean for back-compat. Alternatively we could have three separate ones if you think that makes more sense? Either way, let's reconcile this.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -466,18 +476,22 @@ protected final void ensureOpen() {
    * effectively synchronized as well.
    */
   private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOException {
-    // Before Lucene 2.9, position increments >=0 were supported, so we
-    // added 1 to parent to allow the parent -1 (the parent of the root).
-    // Unfortunately, starting with Lucene 2.9, after LUCENE-1542, this is
-    // no longer enough, since 0 is not encoded consistently either (see
-    // comment in SinglePositionTokenStream). But because we must be
-    // backward-compatible with existing indexes, we can't just fix what
-    // we write here (e.g., to write parent+2), and need to do a workaround
-    // in the reader (which knows that anyway only category 0 has a parent
-    // -1).
-    parentStream.set(Math.max(parent + 1, 1));
     Document d = new Document();
-    d.add(parentStreamField);
+    if (useOlderTermPositionForParentOrdinal) {
+      // Before Lucene 2.9, position increments >=0 were supported, so we
+      // added 1 to parent to allow the parent -1 (the parent of the root).
+      // Unfortunately, starting with Lucene 2.9, after LUCENE-1542, this is
+      // no longer enough, since 0 is not encoded consistently either (see
+      // comment in SinglePositionTokenStream). But because we must be
+      // backward-compatible with existing indexes, we can't just fix what
+      // we write here (e.g., to write parent+2), and need to do a workaround
+      // in the reader (which knows that anyway only category 0 has a parent
+      // -1).
+      parentStream.set(Math.max(parent + 1, 1));

Review comment:
       Maybe `assert parentStreamField != null`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r751969762



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,6 +125,46 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (getMajorVersion(reader) <= 8) {
+      loadParentUsingTermPosition(reader, first);
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues =
+          leafContext.reader().getNumericDocValues(Consts.FIELD_PARENT_ORDINAL_NDV);
+      if (parentValues == null) {
+        throw new CorruptIndexException(
+            "Parent data field " + Consts.FIELD_PARENT_ORDINAL_NDV + " not exists",
+            leafContext.reader().toString());
+      }
+
+      for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) {
+        if (parentValues.advanceExact(doc) == false) {
+          throw new CorruptIndexException(
+              "Missing parent data for category " + (doc + leafContext.docBase), reader.toString());
+        }
+        // we're putting an int and converting it back so it should be safe
+        parents[doc + leafContext.docBase] = Math.toIntExact(parentValues.longValue());
+      }
+    }
+  }
+
+  private static int getMajorVersion(IndexReader reader) {
+    return reader.leaves().get(0).reader().getMetaData().getCreatedVersionMajor();

Review comment:
       Is the index guaranteed to be non-empty? Otherwise getting the first leaf could throw an IndexOutOfBoundsException.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r750735847



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -161,12 +165,14 @@ public DirectoryTaxonomyWriter(Directory directory, OpenMode openMode, TaxonomyW
       indexEpoch = 1;
       // no commit exists so we can safely use the new BinaryDocValues field
       useOlderStoredFieldIndex = false;
+      useOlderTermPositionForParentOrdinal = false;
     } else {
       String epochStr = null;
 
       SegmentInfos infos = SegmentInfos.readLatestCommit(dir);
       /* a previous commit exists, so check the version of the last commit */
       useOlderStoredFieldIndex = infos.getIndexCreatedVersionMajor() <= 8;
+      useOlderTermPositionForParentOrdinal = infos.getIndexCreatedVersionMajor() <= 8;

Review comment:
       Yes I guess we don't need to be that fancy (the segmented by segmented one) on this change LOL.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r751544194



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -92,15 +93,18 @@
   private final Directory dir;
   private final IndexWriter indexWriter;
   private final boolean useOlderStoredFieldIndex;
+  private final boolean useOlderTermPositionForParentOrdinal;

Review comment:
       Yeah I was thinking the same when making the change. I think it's not necessary to use the different variables, seems you have changed the name to `useOlderFormat`, I'll just use that name too (but I guess there'll be a merge conflict anyway, any idea on how to avoid that?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand merged pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
mikemccand merged pull request #442:
URL: https://github.com/apache/lucene/pull/442


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r750653939



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {
+      // The parent array is already loaded
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {

Review comment:
       We can do this because the taxo index ensures that new ordinals are appended into new segments at the end of the index, on each refresh?
   
   E.g. it's using `LogDocMergePolicy` or `LogByteSizeMergePolicy`?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {
+      // The parent array is already loaded
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues =
+          leafContext.reader().getNumericDocValues(Consts.FIELD_PARENT_ORDINAL_NDV);

Review comment:
       Maybe also `null`-check `parentValues` and throw `CorruptIndexException` if so?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {

Review comment:
       Could we maybe make this more explicit?  Check the created version from `SegmentInfos`?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -130,15 +125,49 @@ private void initParents(IndexReader reader, int first) throws IOException {
       return;
     }
 
+    if (tryLoadParentUsingTermPosition(reader, first)) {
+      // The parent array is already loaded
+      return;
+    }
+
+    for (LeafReaderContext leafContext : reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues =
+          leafContext.reader().getNumericDocValues(Consts.FIELD_PARENT_ORDINAL_NDV);
+      for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) {
+        if (parentValues.advanceExact(doc) == false) {
+          throw new CorruptIndexException(
+              "Missing parent data for category " + doc + leafContext.docBase, reader.toString());

Review comment:
       Hmm maybe put parens around `(doc + leafContext.docBase)`?  I'm not sure about the order-of-concatenation-or-addition here.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -161,12 +165,14 @@ public DirectoryTaxonomyWriter(Directory directory, OpenMode openMode, TaxonomyW
       indexEpoch = 1;
       // no commit exists so we can safely use the new BinaryDocValues field
       useOlderStoredFieldIndex = false;
+      useOlderTermPositionForParentOrdinal = false;
     } else {
       String epochStr = null;
 
       SegmentInfos infos = SegmentInfos.readLatestCommit(dir);
       /* a previous commit exists, so check the version of the last commit */
       useOlderStoredFieldIndex = infos.getIndexCreatedVersionMajor() <= 8;
+      useOlderTermPositionForParentOrdinal = infos.getIndexCreatedVersionMajor() <= 8;

Review comment:
       So this means, even if you are running Lucene 9.x, if you open an older (created in Lucene 8.x) index, you will continue to use the "old way" (hacky custom positions).  I like that approach (versus going segment by segment which gets trickier).
   
   One must create a new 9.x index to get the new DV encoding.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org