You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/07/21 08:38:53 UTC

[GitHub] [lucene] gautamworah96 opened a new pull request #220: LUCENE-9450: Use BinaryDocValue fields with a different name in the taxonomy index

gautamworah96 opened a new pull request #220:
URL: https://github.com/apache/lucene/pull/220


   Category documents added in the Lucene 9.0 taxonomy index use a BDV field with a different name
   
   Using BDV fields with a different "$full_path_binary$" name
   ensures that the earlier "$full_path$" StringField does not have the same name as the
   BDV field and hence they don't violate the field type consistency check
   (LUCENE-9334).
   
   This commit also enables the back-compat check that was disabled
   earlier.
   
   https://issues.apache.org/jira/browse/LUCENE-9450
   
   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene:
   
   * https://issues.apache.org/jira/projects/LUCENE
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   
   LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   # Solution
   
   There were two proposed solutions in the JIRA ticket:
   
   1. Add the BDV field with a different name. 
   When we were adding the BDV field with the same `Consts.FULL` name, it was causing a `    java.lang.IllegalArgumentException: cannot change field "$full_path$" from doc values type=NONE to inconsistent doc values type=BINARY
   ` error because the current logic checks all fields with the same name across segments and ensures that they use the same BinaryDocValues field TYPE. 
   
   Adding the BDV field with a different name ensures that the [check](https://github.com/apache/lucene/blob/28ba8b77975fc1b5a1a07da373916a2b21ea09aa/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L260) does not trip. We are careful here to use the same new name when trying to retrieve `values` in the `DirectoryTaxonomyReader`
   
   2. Perform a check on the index version when we try to add a BDV field. If the index is pre 9.0 we only add the StringField and use only that field when trying to read the value from the index. If the index is newer (>=9.0), we add and read the value from a BDV field.
   
   This PR implements the approach described in step 1.
   
   # Tests
   
   Enabled the back-compat test in `TestBackwardsCompatibility.testCreateNewTaxonomy`
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gautamworah96 commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r677669790



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx
 
     String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length);
     fullPathField.setStringValue(fieldPath);
+
+    if (useOlderStoredFieldIndex) {
+      fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES);

Review comment:
       So when we first initialize the `fieldPath` variable it is [initialized](https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java#L185) with `Field.Store.NO`. Then when we realize that the field has to be stored as a StringField, we need its value as well, so I reassign it with a `new StringField(Consts.FULL, fieldPath, Field.Store.YES)` statement.
   
   > Maybe just move the above line inside the if?
   
   Sure
   
   > why don't we also re-use the BinaryDocValues field
   
   Hmmm. Could you elaborate a bit? The field depends on the input `FacetLabel categoryPath`
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand closed pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

Posted by GitBox <gi...@apache.org>.
mikemccand closed pull request #220:
URL: https://github.com/apache/lucene/pull/220


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r677406777



##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
##########
@@ -49,13 +49,8 @@
   //
   // Then move the zip file to your trunk checkout and use it in your test cases
 
-  public static final String oldTaxonomyIndexName = "taxonomy.8.6.3-cfs";
+  public static final String oldTaxonomyIndexName = "taxonomy.8.10.0-cfs";
 
-  // LUCENE-9334 requires consistency of field data structures between documents.
-  // Old taxonomy index had $full_path$ field indexed only with postings,
-  // It is not allowed to add the same field $full_path$ indexed with BinaryDocValues
-  // for a new segment, that this test is trying to do.
-  @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9334")

Review comment:
       Woot!

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx
 
     String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length);
     fullPathField.setStringValue(fieldPath);
+
+    if (useOlderStoredFieldIndex) {
+      fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES);

Review comment:
       Hmm this is odd -- up above this `if` we are already calling `setStringValue` on the reused `fullPathField`, but then inside this `if` we override `fullPathField` with a new `StringField` instance?
   
   Maybe just move the above line inside the `if`?
   
   Also, why don't we also re-use the `BinaryDocValues` field?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -164,9 +159,16 @@ public DirectoryTaxonomyWriter(Directory directory, OpenMode openMode, TaxonomyW
     openMode = config.getOpenMode();
     if (!DirectoryReader.indexExists(directory)) {

Review comment:
       Maybe take the opportunity to repair this horrible `!` by switching it to `== false`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gautamworah96 commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r678493344



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx
 
     String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length);
     fullPathField.setStringValue(fieldPath);
+
+    if (useOlderStoredFieldIndex) {
+      fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES);

Review comment:
       Hmmm. I guess I did not find it confusing but it is a bit strange for sure. The next commit initializes it upfront




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gautamworah96 commented on pull request #220: LUCENE-9450: Use BinaryDocValue fields with a different name in the taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on pull request #220:
URL: https://github.com/apache/lucene/pull/220#issuecomment-886968851


   Changes in the new b9cbc4c commit:
   
   1. The reason why the `SegmentInfos.readLatestCommit(dir).getMinSegmentLuceneVersion()` call was returning 9 as the version, was that the older zip file in the mainline was using the Lucene 8.6 Codec but the major version variable was still assigned as 9. This was because the `main` branch in the repo (during the 8.6 release) had already set the major version as 9. I reconstructed the 8.10 taxonomy index from the `branch_8x` branch and that correctly set the major version as 8 for those older segments.
   2. Use a version based check for storing BDV fields or StringFields 
   
   I think the new commit might be slower that the previous `$full_path_binary$` option during indexing because it checks the Lucene version of the last commit everytime we add a new category.
    
   Finally, I think there should be a cleaner way of knowing if the index has atleast one commit or no. I use the `indexWriter.getLiveCommitData().iterator().hasNext()` call but maybe there is a better way..
   
   Side questions that need more thought:
   1. What is the use of the `LiveIndexWriterConfig.createdVersionMajor` param. I think instead of initializing it to the latest version, maybe we can assign the value of the min back compat version of the index to it (when the `LiveIndexWriterConfig` class is initialized).
   2. Can we fix the `DirectoryTaxonomyWriter.indexEpoch` variable to hold the accurate index epoch of the taxonomy index. 
   The current logic for `indexEpoch` assigns 1 even if the index is completely fresh. It also saves 1 as the value when the index has just 1 commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields with a different name in the taxonomy index

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r676902852



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -475,8 +476,20 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx
 
     String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length);
     fullPathField.setStringValue(fieldPath);
+
+    boolean commitExists = indexWriter.getLiveCommitData().iterator().hasNext();
+    /* no commits so this is a fresh index, or the old index was built using a Lucene 9 or greater version */
+    if ((commitExists == false)
+        || (SegmentInfos.readLatestCommit(dir)

Review comment:
       This is a horrifyingly costly check to do for every added `FacetLabel`!  Couldn't we do this check once in ctor when this `TaxonomyWriter` is created, and store the result in a `final boolean`?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -475,8 +476,20 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx
 
     String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length);
     fullPathField.setStringValue(fieldPath);
+
+    boolean commitExists = indexWriter.getLiveCommitData().iterator().hasNext();
+    /* no commits so this is a fresh index, or the old index was built using a Lucene 9 or greater version */
+    if ((commitExists == false)
+        || (SegmentInfos.readLatestCommit(dir)
+            .getMinSegmentLuceneVersion()
+            .onOrAfter(Version.LUCENE_9_0_0))) {
+      /* Lucene 9 introduces BinaryDocValuesField for storing taxonomy categories */

Review comment:
       Maybe `switches to` instead of `introduces`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #220:
URL: https://github.com/apache/lucene/pull/220#issuecomment-887472869


   Also, to be clear, even though the opening comment says the PR implemented option 1, it has now iterated onto option 2 (switching based on the index created version metadata).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r677908217



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx
 
     String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length);
     fullPathField.setStringValue(fieldPath);
+
+    if (useOlderStoredFieldIndex) {
+      fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES);

Review comment:
       OK it's just super confusing that we act like we will reuse it (by creating it in ctor) but then sometimes overwrite it and other times re-use it.
   
   Couldn't we init it with the right `Field.Store.YES|NO` option up front, since we know in ctor which way it'll go, and then just always re-use it?
   
   And let's scratch my suggestion to also reuse the `BDV` field -- let's just make a new one each time like you are doing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #220:
URL: https://github.com/apache/lucene/pull/220#issuecomment-888288068


   > What is the use of the LiveIndexWriterConfig.createdVersionMajor
   
   It's very expert. It's necessary if you have multiple workers creating indices that you then want to merge together using `IndexWriter#addIndexes`. `addIndexes` requires that all indices have the same major version, so if you are doing a rolling upgrade on your workers to a new Lucene major, this helps ensure that all indices are created in a way that they can be merged eventually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gautamworah96 commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r677794298



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##########
@@ -164,9 +159,16 @@ public DirectoryTaxonomyWriter(Directory directory, OpenMode openMode, TaxonomyW
     openMode = config.getOpenMode();
     if (!DirectoryReader.indexExists(directory)) {

Review comment:
       Sure




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #220:
URL: https://github.com/apache/lucene/pull/220#issuecomment-889320876


   OK I just merged this via git command-line, but apparently GitHub hasn't noticed.  Thanks @gautamworah96 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org