You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/01 19:47:06 UTC

[GitHub] [lucene] zhaih opened a new pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

zhaih opened a new pull request #420:
URL: https://github.com/apache/lucene/pull/420


   
   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene:
   
   * https://issues.apache.org/jira/projects/LUCENE
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   
   LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   As mentioned in the issue, use NumericDocValues to store parent array instead of term positioning.
   
   # TODO
    * benchmark
    * address the backward compatibility issue
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [ ] I have run `./gradlew check`. (Not yet, need to deal with backward compatibility issue)
   - [ ] I have added tests for my changes. (Old tests should be sufficient)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #420:
URL: https://github.com/apache/lucene/pull/420#discussion_r741939064



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {

Review comment:
       Maybe there are benefits of MultiDocValues I'm missing for this specific use-case, but in general we prefer consuming data-structures segment-by-segment whenever possible and only rely on the MultiXXX classes for merging.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a change in pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
gsmiller commented on a change in pull request #420:
URL: https://github.com/apache/lucene/pull/420#discussion_r741472877



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {

Review comment:
       I know this is just for benchmarking right now, but for what it's worth, I think you can probably simplify here and rely on `MultiDocValues#getNumericValues` if you move forward with this work.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #420:
URL: https://github.com/apache/lucene/pull/420#discussion_r741939064



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {

Review comment:
       Maybe there are benefits of MultiDocValues I'm missing for this specific use-case, but in general we prefer consuming data-structures segment-by-segment whenever possible and only rely on the MultiXXX classes for merging.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #420:
URL: https://github.com/apache/lucene/pull/420#discussion_r743030399



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues = leafContext.reader().getNumericDocValues(Consts.FIELD_PAYLOADS);
+      for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) {
+        if (parentValues.advanceExact(doc) == false) {
+          throw new CorruptIndexException("Missing parent data for category " + doc + leafContext.docBase, reader.toString());
         }
-      } else { // this shouldn't happen
-        throw new CorruptIndexException("Missing parent data for category " + i, reader.toString());
+        // we're putting a int and converting it back so it should be safe
+        parents[doc + leafContext.docBase] = (int) parentValues.longValue();

Review comment:
       Maybe use `Math.toIntExact` for paranoia?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues = leafContext.reader().getNumericDocValues(Consts.FIELD_PAYLOADS);
+      for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) {
+        if (parentValues.advanceExact(doc) == false) {
+          throw new CorruptIndexException("Missing parent data for category " + doc + leafContext.docBase, reader.toString());
         }
-      } else { // this shouldn't happen
-        throw new CorruptIndexException("Missing parent data for category " + i, reader.toString());
+        // we're putting a int and converting it back so it should be safe

Review comment:
       s/`a int`/`an int`

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -466,18 +463,8 @@ protected final void ensureOpen() {
    * effectively synchronized as well.
    */
   private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOException {
-    // Before Lucene 2.9, position increments >=0 were supported, so we
-    // added 1 to parent to allow the parent -1 (the parent of the root).
-    // Unfortunately, starting with Lucene 2.9, after LUCENE-1542, this is

Review comment:
       Phew, I'm so happy we can remove this comment!  That issue was hellacious back compat nightmare.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a change in pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
gsmiller commented on a change in pull request #420:
URL: https://github.com/apache/lucene/pull/420#discussion_r743098739



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {

Review comment:
       The only benefit I had in mind was some code simplicity since `MultiDocValues` handles iterating through all the leaves for us, which is what we're doing here. But maybe there's some overhead or other downsides to using `MultiDocValues`?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {

Review comment:
       The only benefit I had in mind was some code simplicity since `MultiDocValues` handles iterating through all the leaves for us, which is what we're doing here. But maybe there's some overhead or other downsides to using `MultiDocValues`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih commented on pull request #420:
URL: https://github.com/apache/lucene/pull/420#issuecomment-966534774


   I'm going to close this PR given not seeing performance gain but a 5% inflated taxonomy index, details can be found in the JIRA issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #420:
URL: https://github.com/apache/lucene/pull/420#discussion_r741939064



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {

Review comment:
       Maybe there are benefits of MultiDocValues I'm missing for this specific use-case, but in general we prefer consuming data-structures segment-by-segment whenever possible and only rely on the MultiXXX classes for merging.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {

Review comment:
       Maybe there are benefits of MultiDocValues I'm missing for this specific use-case, but in general we prefer consuming data-structures segment-by-segment whenever possible and only rely on the MultiXXX classes for merging.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih closed pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
zhaih closed pull request #420:
URL: https://github.com/apache/lucene/pull/420


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #420:
URL: https://github.com/apache/lucene/pull/420#discussion_r743030399



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues = leafContext.reader().getNumericDocValues(Consts.FIELD_PAYLOADS);
+      for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) {
+        if (parentValues.advanceExact(doc) == false) {
+          throw new CorruptIndexException("Missing parent data for category " + doc + leafContext.docBase, reader.toString());
         }
-      } else { // this shouldn't happen
-        throw new CorruptIndexException("Missing parent data for category " + i, reader.toString());
+        // we're putting a int and converting it back so it should be safe
+        parents[doc + leafContext.docBase] = (int) parentValues.longValue();

Review comment:
       Maybe use `Math.toIntExact` for paranoia?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {
+      int leafDocNum = leafContext.reader().maxDoc();
+      if (leafContext.docBase + leafDocNum <= first) {
+        // skip this leaf if it does not contain new categories
+        continue;
+      }
+      NumericDocValues parentValues = leafContext.reader().getNumericDocValues(Consts.FIELD_PAYLOADS);
+      for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) {
+        if (parentValues.advanceExact(doc) == false) {
+          throw new CorruptIndexException("Missing parent data for category " + doc + leafContext.docBase, reader.toString());
         }
-      } else { // this shouldn't happen
-        throw new CorruptIndexException("Missing parent data for category " + i, reader.toString());
+        // we're putting a int and converting it back so it should be safe

Review comment:
       s/`a int`/`an int`

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -466,18 +463,8 @@ protected final void ensureOpen() {
    * effectively synchronized as well.
    */
   private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOException {
-    // Before Lucene 2.9, position increments >=0 were supported, so we
-    // added 1 to parent to allow the parent -1 (the parent of the root).
-    // Unfortunately, starting with Lucene 2.9, after LUCENE-1542, this is

Review comment:
       Phew, I'm so happy we can remove this comment!  That issue was hellacious back compat nightmare.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a change in pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

Posted by GitBox <gi...@apache.org>.
gsmiller commented on a change in pull request #420:
URL: https://github.com/apache/lucene/pull/420#discussion_r743098739



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##########
@@ -129,39 +124,19 @@ private void initParents(IndexReader reader, int first) throws IOException {
     if (reader.maxDoc() == first) {
       return;
     }
-
-    // it's ok to use MultiTerms because we only iterate on one posting list.
-    // breaking it to loop over the leaves() only complicates code for no
-    // apparent gain.
-    PostingsEnum positions =
-        MultiTerms.getTermPostingsEnum(
-            reader, Consts.FIELD_PAYLOADS, Consts.PAYLOAD_PARENT_BYTES_REF, PostingsEnum.PAYLOADS);
-
-    // shouldn't really happen, if it does, something's wrong
-    if (positions == null || positions.advance(first) == DocIdSetIterator.NO_MORE_DOCS) {
-      throw new CorruptIndexException(
-          "Missing parent data for category " + first, reader.toString());
-    }
-
-    int num = reader.maxDoc();
-    for (int i = first; i < num; i++) {
-      if (positions.docID() == i) {
-        if (positions.freq() == 0) { // shouldn't happen
-          throw new CorruptIndexException(
-              "Missing parent data for category " + i, reader.toString());
-        }
-
-        parents[i] = positions.nextPosition();
-
-        if (positions.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
-          if (i + 1 < num) {
-            throw new CorruptIndexException(
-                "Missing parent data for category " + (i + 1), reader.toString());
-          }
-          break;
+    for (LeafReaderContext leafContext: reader.leaves()) {

Review comment:
       The only benefit I had in mind was some code simplicity since `MultiDocValues` handles iterating through all the leaves for us, which is what we're doing here. But maybe there's some overhead or other downsides to using `MultiDocValues`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org