You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/08/10 21:40:23 UTC

[GitHub] [lucene-solr] gautamworah96 opened a new pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

gautamworah96 opened a new pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene or Solr:
   
   * https://issues.apache.org/jira/projects/LUCENE
   * https://issues.apache.org/jira/projects/SOLR
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   * SOLR-####: <short description of problem or changes>
   
   LUCENE and SOLR must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   This PR modifies the taxonomy writer and reader implementation to use BinaryDocValues instead of StoredValues. 
   The taxonomy index uses stored fields today and must do a number of stored field lookups for each query to resolve taxonomy ordinals back to human presentable facet labels.
   
   # Solution
   
   Change the storage format to use DocValues
   
   # Tests
   
   ant test fails because
   `.binaryValue()` returns a `NullPointerException`
   
   To reproduce the error:
   `ant test  -Dtestcase=TestExpressionAggregationFacetsExample -Dtests.method=testSimple -Dtests.seed=4544BD51622879A4 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=si -Dtests.timezone=Antarctica/DumontDUrville -Dtests.asserts=true -Dtests.file.encoding=US-ASCII`
   
   gives
   
   ```nit4:pickseed] Seed property 'tests.seed' already defined: 4544BD51622879A4
       [mkdir] Created dir: /Users/gauworah/opensource/mystuff/lucene-solr/lucene/build/demo/test/temp
      [junit4] <JUnit4> says Привет! Master seed: 4544BD51622879A4
      [junit4] Executing 1 suite with 1 JVM.
      [junit4] 
      [junit4] Started J0 PID(76859@localhost).
      [junit4] Suite: org.apache.lucene.demo.facet.TestExpressionAggregationFacetsExample
      [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestExpressionAggregationFacetsExample -Dtests.method=testSimple -Dtests.seed=4544BD51622879A4 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=si -Dtests.timezone=Antarctica/DumontDUrville -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
      [junit4] ERROR   0.61s | TestExpressionAggregationFacetsExample.testSimple <<<
      [junit4]    > Throwable #1: java.lang.NullPointerException
      [junit4]    >        at __randomizedtesting.SeedInfo.seed([4544BD51622879A4:7DF799AF45DBAD75]:0)
      [junit4]    >        at org.apache.lucene.index.MultiDocValues$3.binaryValue(MultiDocValues.java:403)
      [junit4]    >        at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.getPath(DirectoryTaxonomyReader.java:328)
      [junit4]    >        at org.apache.lucene.facet.taxonomy.FloatTaxonomyFacets.getTopChildren(FloatTaxonomyFacets.java:151)
      [junit4]    >        at org.apache.lucene.demo.facet.ExpressionAggregationFacetsExample.search(ExpressionAggregationFacetsExample.java:107)
      [junit4]    >        at org.apache.lucene.demo.facet.ExpressionAggregationFacetsExample.runSearch(ExpressionAggregationFacetsExample.java:118)
      [junit4]    >        at org.apache.lucene.demo.facet.TestExpressionAggregationFacetsExample.testSimple(TestExpressionAggregationFacetsExample.java:28)
      [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      [junit4]    >        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      [junit4]    >        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
      [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:830)
   ```
   
   3 other tests also fail at the same line
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [ ] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   
   **This is a draft PR**


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] ErickErickson commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

ErickErickson commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-696828850


   
   
   > On Sep 22, 2020, at 11:04 AM, Michael McCandless <no...@github.com> wrote:
   > 
   > 
   > So I propose we get rid of the fullPathField altogether.
   > 
   > Wow, +1, this looks like it is (pre-existingly?) double-indexed? Maybe we should do this as a separate pre-cursor PR to this one (switch to StoredField when indexing the fullPathField)?
   > 
   > For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable.
   > 
   > Yeah +1 to, on a hit by hit basis, try BinaryDocValues first, and then fallback to the StoredField. This is the cost of backwards compatibility ... though, for a fully new (all BinaryDocValues) index, the performance should be fine. Also, note that in Lucene 10.x we can remove that back-compat fallback.
   > 
   > Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.
   > 
   > I think this would indeed be tricky.
   
   Andrzej and I spent quite a bit of time trying to get something similar to work for adding docValues on the fly using a custom merge policy. We realized that you could create a docValues field from an indexed field for primitive types since all the information was already in the index. We never could get it working if there was active indexing happening, so resorted to a batch process that rewrote all segments doing the transformation along the way that had to be run on a quiescent index, the client decided that was good enough and didn’t want to spend more time on it.
   
   Our best guess was that there was a race condition that we somehow couldn’t find in the time allowed… Mostly just FYI...
   
   FWIW,
   Erick
   
   > 
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub, or unsubscribe.
   > 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

gautamworah96 commented on a change in pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#discussion_r474356325



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -322,9 +324,14 @@ public FacetLabel getPath(int ordinal) throws IOException {
         return res;
       }
     }
-    
-    Document doc = indexReader.document(ordinal);
-    FacetLabel ret = new FacetLabel(FacetsConfig.stringToPath(doc.get(Consts.FULL)));
+
+    int readerIndex = ReaderUtil.subIndex(ordinal, indexReader.leaves());
+    LeafReader leafReader = indexReader.leaves().get(readerIndex).reader();
+    BinaryDocValues values = leafReader.getBinaryDocValues(Consts.FULL);

Review comment:
       Could you give an example of what that bulk lookup function params would look like / some reference in the code where something of this sort is implemented? 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-696781342


   > So I propose we get rid of the fullPathField altogether.
   
   Wow, +1, this looks like it is (pre-existingly?) double-indexed?  Maybe we should do this as a separate pre-cursor PR to this one (switch to `StoredField` when indexing the `fullPathField`)?
   
   
   > For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable.
   
   Yeah +1 to, on a hit by hit basis, try `BinaryDocValues` first, and then fallback to the `StoredField`.  This is the cost of backwards compatibility ... though, for a fully new (all `BinaryDocValues`) index, the performance should be fine.  Also, note that in Lucene 10.x we can remove that back-compat fallback.
   
   > Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.
   
   I think this would indeed be tricky.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-673498515


   Woohoo, tests all pass now?  What a tiny change it turned out to be :)
   
   Can you try to run `luceneutil` benchmarks?  Let's see if this is net/net faster.  Even if it is the same speed, we should move forward -- stored fields are likely to get more compressed / slower to access over time, e.g. https://issues.apache.org/jira/browse/LUCENE-9447.
   
   We can also (separate followon issue!) better optimized the `ord -> FacetLabel` lookup to do them in bulk, in order, so we can share a single `BinaryDocValues` instance per leaf per query.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-696781342


   > So I propose we get rid of the fullPathField altogether.
   
   Wow, +1, this looks like it is (pre-existingly?) double-indexed?  Maybe we should do this as a separate pre-cursor PR to this one (switch to `StoredField` when indexing the `fullPathField`)?
   
   
   > For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable.
   
   Yeah +1 to, on a hit by hit basis, try `BinaryDocValues` first, and then fallback to the `StoredField`.  This is the cost of backwards compatibility ... though, for a fully new (all `BinaryDocValues`) index, the performance should be fine.  Also, note that in Lucene 10.x we can remove that back-compat fallback.
   
   > Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.
   
   I think this would indeed be tricky.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

gautamworah96 commented on a change in pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#discussion_r470204070



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -322,9 +324,14 @@ public FacetLabel getPath(int ordinal) throws IOException {
         return res;
       }
     }

Review comment:
       Opened [LUCENE-9460](https://issues.apache.org/jira/browse/LUCENE-9460?filter=-2) for this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#discussion_r468812527



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -323,8 +323,10 @@ public FacetLabel getPath(int ordinal) throws IOException {
       }
     }
     
-    Document doc = indexReader.document(ordinal);
-    FacetLabel ret = new FacetLabel(FacetsConfig.stringToPath(doc.get(Consts.FULL)));
+    boolean found = MultiDocValues.getBinaryValues(indexReader, Consts.FULL).advanceExact(catIDInteger);

Review comment:
       One more idea: instead of using `MultiDocValues` sugar, I think we should use Lucene's `ReaderUtil` to quickly (binary search) determine which leaf holds this `docId`, then pull `BinaryDocValues` from that `LeafReader`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

gautamworah96 commented on a change in pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#discussion_r469727743



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -323,8 +323,10 @@ public FacetLabel getPath(int ordinal) throws IOException {
       }
     }
     
-    Document doc = indexReader.document(ordinal);
-    FacetLabel ret = new FacetLabel(FacetsConfig.stringToPath(doc.get(Consts.FULL)));
+    boolean found = MultiDocValues.getBinaryValues(indexReader, Consts.FULL).advanceExact(catIDInteger);

Review comment:
       Thank you for looking at it so closely (and helping in debugging).
   
   The new PR has the following changes:
   1. Use `ordinal`  instead of `catIDInteger` (IntelliJ says that boxing is anyways not needed, perhaps we can remove?)
   2. Use the correct `values` instance that has advanced to the correct `docId`
   3. Use `ReaderUtil` to get to the `leaf` and then use `LeafReader` instead of using the higher level `MultiDocValues` call
   
   TEST:
   `ant test` in the `lucene-solr/lucene/facets` directory (passes successfully)
   `ant precommit`
   
   I've not added any new tests because this PR changes a low level implementation detail and the current tests already cover this
   
   The next step is to run lucene benchmarks!
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#discussion_r469977297



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -322,9 +324,14 @@ public FacetLabel getPath(int ordinal) throws IOException {
         return res;
       }
     }

Review comment:
       Pre-existing: I don't like that we `return null` up above if the requested `ordinal` is out-of-bounds.  That's dangerous leniency and likely means the user is refreshing their main `IndexReader` and the `TaxonomyReader` in the wrong order.  It would be better to throw an exception here?  @gautamworah96 could you open a follow-on issue to fix that?  Thanks.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -494,6 +496,7 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx
 
     fullPathField.setStringValue(FacetsConfig.pathToString(categoryPath.components, categoryPath.length));
     d.add(fullPathField);
+    d.add(new BinaryDocValuesField(Consts.FULL, new BytesRef(FacetsConfig.pathToString(categoryPath.components, categoryPath.length))));

Review comment:
       Could you factor out the `FacetsConfig.pathToString(...)` part in a new local variable and re-use that?  We use it in (at least?) two places here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand merged pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand merged pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] goankur commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

goankur commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-696420384

Thanks @gautamworah96 for this impactful change and @mikemccand for reviewing it.
A few thoughts

1. This change disables STORED fields part but keeps the POSTINGS part here
`fullPathField = new StringField(Consts.FULL, "", Field.Store.NO); ` which is unnecessary as postings are already enabled for facet labels in [FacetsConfig#L364-L399](https://github.com/apache/lucene-solr/blob/master/lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java#L364-L366) including [dimension drill-down](https://github.com/apache/lucene-solr/blob/master/lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java#L88). So I propose we get rid of the `fullPathField` altogether.

2. For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable. Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] gautamworah96 commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

gautamworah96 commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-723694318


   The earlier revision of this PR had backward compatibility test failures after merging because the Lucene codec has changed.
   I've added a `testCompile` dependency on Lucene's 8.6.3 backward codecs.
   
   There are minor changes in the `versions.lock` file to address the `found dependencies that were not in the lock state` error.
   This file was autogenerated by running `./gradlew --write-locks`.
   
   The current merge failure is due to this `versions.lock` file change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#discussion_r474059462



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -322,9 +324,14 @@ public FacetLabel getPath(int ordinal) throws IOException {
         return res;
       }
     }
-    
-    Document doc = indexReader.document(ordinal);
-    FacetLabel ret = new FacetLabel(FacetsConfig.stringToPath(doc.get(Consts.FULL)));
+
+    int readerIndex = ReaderUtil.subIndex(ordinal, indexReader.leaves());
+    LeafReader leafReader = indexReader.leaves().get(readerIndex).reader();
+    BinaryDocValues values = leafReader.getBinaryDocValues(Consts.FULL);

Review comment:
       Could you open a new issue to optimize this better in the future, to do bulk lookup of `ordinal -> FacetLabel`?  And then add a `// TODO ` pointing to that issue and describing the optimization?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] goankur commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

goankur commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-696420384

Thanks @gautamworah96 for this impactful change and @mikemccand for reviewing it.
A few thoughts

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-679097363

I think the back-compat layer at read-time is a good start, but is not quite enough.

Imagine Susan. She upgrades to Lucene with this change, pushed, as it is now. Susan runs some queries against her index, resolves ordinals, using the stored fields, and all is good. Susan indexes some more documents with facet labels, and these new segments in the taxonomy index are written using `BinaryDocValues`. Susan refreshes and runs some queries, and some ordinals resolve using old way (if they came from old segments), and some the new way (if they came from new segments). Life goes on, more documents indexed. Suddenly, Susan's taxonomy index executes a merge! Merging old and new segments together, now the newly merged segment has a mix of some docs that used stored fields while others used `BinaryDocValues`, and the back compat logic will becomes confused and incorrectly try to use `BinaryDocValues` instead of stored fields, and I think that `assert` will trip for such documents?

Could you try to add a test case showing this case? Have a look at Lucene's `TestBackwardsCompatibility` -- it tests the main index, but you can borrow the ideas (e.g. APIs to zip/unzip) to implement a new unit test confirming we are maintaining back-compat for taxonomy index?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#discussion_r468808904



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -323,8 +323,10 @@ public FacetLabel getPath(int ordinal) throws IOException {
       }
     }
     
-    Document doc = indexReader.document(ordinal);
-    FacetLabel ret = new FacetLabel(FacetsConfig.stringToPath(doc.get(Consts.FULL)));
+    boolean found = MultiDocValues.getBinaryValues(indexReader, Consts.FULL).advanceExact(catIDInteger);

Review comment:
       OK, I see one issue -- you are pulling a new `BinaryDocValues`, calling `.advanceExact` on it (good), but then pulling a new `BinaryDocValues` below and not calling `.advanceExact` on it.
   
   I think you must add a new local variable, e.g. `BinaryDocValues values`.  Pull it once (using the `MultiDocValues.getBinaryValues` sugar API).  Then call `.advanceExact` on that and assert it succeeded. Finally, use that same `values` instance (now that it has advanced to the right `docId`) to call `.binaryValue().utf8ToString()`.
   
   I think that should fix the `NPE`?
   
   This is misuse of the API for the default Lucene Codec for `BinaryDocValues`, since you were calling `.binaryValue()` before `.advanceExact()`.  It is somewhat disappointing that the codec threw a confusing `NPE` and not a clearer (best effort) exception stating that you must first call `.advanceExact`.  Maybe we could improve the default Codec?  (Though, not if that would hurt performance of correct usage).  OK I see: the `NPE` is because of `MultiDocValues.currentValues` is `null` since `.advanceExact` was not yet called.  Maybe we could add an `assert` there, confirming `.advanceExact` was indeed called and had returned `true`?  It would have made debugging this easier, and should not hurt performance when assertions are disabled ...

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -323,8 +323,10 @@ public FacetLabel getPath(int ordinal) throws IOException {
       }
     }
     
-    Document doc = indexReader.document(ordinal);
-    FacetLabel ret = new FacetLabel(FacetsConfig.stringToPath(doc.get(Consts.FULL)));
+    boolean found = MultiDocValues.getBinaryValues(indexReader, Consts.FULL).advanceExact(catIDInteger);

Review comment:
       I think instead of the boxed `Integer catIDInteger` we should pass the `int ordinal` to `.advanceExact(...)`?  Not the cause of the `NPE`, just cleaner.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

gautamworah96 commented on a change in pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#discussion_r470160799



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -494,6 +496,7 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx
 
     fullPathField.setStringValue(FacetsConfig.pathToString(categoryPath.components, categoryPath.length));
     d.add(fullPathField);
+    d.add(new BinaryDocValuesField(Consts.FULL, new BytesRef(FacetsConfig.pathToString(categoryPath.components, categoryPath.length))));

Review comment:
       Yep, this could be cleaner. Extracted out in the next commit




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

mikemccand commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-677728465

This change looks good to me!

I think the biggest issue is what to do about backwards compatibility? Users who upgrade to this release will suddenly see their taxonomy index becomes unreadable.

We could 1) make this a Lucene 9.x only change. Normally, for Lucene's main index, the next major release should be able to read all stable releases from the previous major release. But for the taxonomy index, I suspect it is OK if we relax that and make a hard break to the index. There are very few (but, non-zero) users of Lucene's faceting.

Or 2) we add a basic backwards compatibility support here, and then we can push this to 8.x stable releases. E.g. if we could differentiate when we are opening an already created (based on stored fields) taxonomy index, use the old way, but if we are making a new taxonomy index, use the new way. This would be pretty simple to build, I suspect. E.g. on opening the index, we could try to pull `BinaryDocValues` and if it exists, we know it's the new way, else, use the old way, unless the index is empty, in which case, use the new way?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] ErickErickson commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

ErickErickson commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-696828850


   
   
   > On Sep 22, 2020, at 11:04 AM, Michael McCandless <no...@github.com> wrote:
   > 
   > 
   > So I propose we get rid of the fullPathField altogether.
   > 
   > Wow, +1, this looks like it is (pre-existingly?) double-indexed? Maybe we should do this as a separate pre-cursor PR to this one (switch to StoredField when indexing the fullPathField)?
   > 
   > For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable.
   > 
   > Yeah +1 to, on a hit by hit basis, try BinaryDocValues first, and then fallback to the StoredField. This is the cost of backwards compatibility ... though, for a fully new (all BinaryDocValues) index, the performance should be fine. Also, note that in Lucene 10.x we can remove that back-compat fallback.
   > 
   > Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.
   > 
   > I think this would indeed be tricky.
   
   Andrzej and I spent quite a bit of time trying to get something similar to work for adding docValues on the fly using a custom merge policy. We realized that you could create a docValues field from an indexed field for primitive types since all the information was already in the index. We never could get it working if there was active indexing happening, so resorted to a batch process that rewrote all segments doing the transformation along the way that had to be run on a quiescent index, the client decided that was good enough and didn’t want to spend more time on it.
   
   Our best guess was that there was a race condition that we somehow couldn’t find in the time allowed… Mostly just FYI...
   
   FWIW,
   Erick
   
   > 
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub, or unsubscribe.
   > 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

gautamworah96 commented on a change in pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#discussion_r475300614



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -322,9 +324,14 @@ public FacetLabel getPath(int ordinal) throws IOException {
         return res;
       }
     }
-    
-    Document doc = indexReader.document(ordinal);
-    FacetLabel ret = new FacetLabel(FacetsConfig.stringToPath(doc.get(Consts.FULL)));
+
+    int readerIndex = ReaderUtil.subIndex(ordinal, indexReader.leaves());
+    LeafReader leafReader = indexReader.leaves().get(readerIndex).reader();
+    BinaryDocValues values = leafReader.getBinaryDocValues(Consts.FULL);

Review comment:
       I created a new issue [LUCENE 9476](https://issues.apache.org/jira/browse/LUCENE-9476#)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] gautamworah96 commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

gautamworah96 commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-672021513


   Changes in this revision (incorporated from feedback on JIRA):
   
   * Added a call to `advanceExact()` before calling `.binaryValue()` and an `assert` to check that the field exists in the index
   * Re-added the `StringField` with the `Field.Store.YES` changed to `Field.Store.NO`.
   
   * I've not added new tests at the moment. Trying to get the existing ones to work first.
   
   From the error log:
   Note that the code is able to successfully execute the `assert found` statement (so the field does exist), and it fails on the next line
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] goankur commented on pull request #1733: LUCENE-9450 Use BinaryDocValues in the taxonomy writer

Posted by GitBox <gi...@apache.org>.

goankur commented on pull request #1733:
URL: https://github.com/apache/lucene-solr/pull/1733#issuecomment-696420384

Thanks @gautamworah96 for this impactful change and @mikemccand for reviewing it.
A few thoughts

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org