You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/01/26 04:44:59 UTC

[GitHub] [lucene-solr] gautamworah96 opened a new pull request #2247: WIP: LUCENE-9476 Add basic functionality, basic tests

gautamworah96 opened a new pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene or Solr:
   
   * https://issues.apache.org/jira/projects/LUCENE
   * https://issues.apache.org/jira/projects/SOLR
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   * SOLR-####: <short description of problem or changes>
   
   LUCENE and SOLR must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   In [LUCENE-9450](https://issues.apache.org/jira/browse/LUCENE-9450) we switched the Taxonomy index from Stored Fields to `BinaryDocValues.` In the resulting implementation of the `getPath` code, we create a new `BinaryDocValues`'s values instance for each ordinal. 
   It may happen that we may traverse over the same nodes over and over again if the `getPath` API is called multiple times for ordinals in the same segment/with the same `readerIndex`.
   
   This PR takes advantage of that fact by sorting ordinals and then trying to find out if some of the ordinals are present in the same segment/have the same `readerIndex` (by trying to `advanceExact` to the correct position and not failing) thereby allowing us to reuse the previous `BinaryDocValues` object. 
   
   
   # Solution
   
   Steps:
   1. Sort all ordinals and remember their position so as to store the path in the correct position
   2. Try to `advanceExact` to the correct position with the previously calculated `readerIndex`. If the operation fails, try to find the correct segment for the ordinal and then `advanceExact` to the desired position.
   3. Store this position for future ordinals.
   
   # Tests
   
   Added a new test for the API that compares the individual `getPath` results from ordinals with the bulk FacetLabels returned by the `getBulkPath` API
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r578019284



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+   * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the values in the ordinals
+    // array
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses BinaryDocValues
+          boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+          if (success == false) {
+            return getBulkPathForOlderIndexes(ordinals);
+          }
+        }
+        boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+        assert success;
+        bulkPath[originalPosition[i]] =
+            new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+      }
+    }
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      synchronized (categoryCache) {
+        categoryCache.put(ordinals[i], bulkPath[originalPosition[i]]);

Review comment:
       > This should also (theoretically) be faster than trying to get the lock again and again in a loop?
   
   Hmm, I'm confused: this code is already getting the lock inside a `for` loop?  I guess we could move the `synchronized` outside of the `for` loop?  Or, maybe `javac` is doing this for us already?  But let's make it explicit, or, let's just merge this `for` loop with the one before (and keep acquiring the lock inside the `for` loop)?  One big benefit of the latter approach is that if all of the ordinals were already cached (hopefully typically a common case), we do not need any locking, but with this approach, we still do.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: WIP: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r564970251



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {

Review comment:
       I was also confused as to why the method above this one is named `getPath` and not `getLabel` or `getFacetLabel`. I chose `getBulkPath` to maintain consistency.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r567085518



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,104 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,

Review comment:
       Oooh that is a great idea, and low-hanging fruit, and would greatly reduce the RAM usage for this cache.
   
   I think `DirectoryTaxonomyWriter` also has such a cache that we could change to a native map.
   
   Could you open a spinoff issue?

##########
File path: lucene/CHANGES.txt
##########
@@ -11,6 +11,8 @@ New Features
 
 * LUCENE-9004: Approximate nearest vector search via NSW graphs
 
+* LUCENE-9476: DirectoryTaxonomyReader now provides a getBulkPath API (Gautam Worah)

Review comment:
       Can you expand a bit more?  E.g.:
   
   ```
   LUCENE-9476: add new getBulkPath API to DirectoryTaxonomyReader to more efficiently retrieve FacetLabels for multiple facet ordinals at once
   ```

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,104 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void isOrdinalInIndexReaderRange(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} over an array of
+   * ordinals.
+   *
+   * <p>This API is only available for Lucene indexes created with 8.7+ codec because it uses
+   * BinaryDocValues instead of StoredFields. Use the {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} method for indices
+   * created with codec version older than 8.7.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   * @throws IOException if the taxonomy index is created using the older StoredFields based codec.
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    FacetLabel[] bulkPath = new FacetLabel[ordinals.length];
+    // remember the original positions of ordinals before they are sorted
+    IntIntScatterMap originalPosition = new IntIntScatterMap();
+    int indexReaderMaxDoc = indexReader.maxDoc();
+    for (int i = 0; i < ordinals.length; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      isOrdinalInIndexReaderRange(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinals[i], i);
+    }
+
+    Arrays.sort(ordinals);
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext = null;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinals) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null || ord > leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ord, indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses BinaryDocValues
+          boolean success = values.advanceExact(ord - leafReaderDocBase);
+          if (success == false) {
+            throw new IOException(
+                "the taxonomy index is created using the older StoredFields format which uses a Lucene "
+                    + "codec older than 8.7. Use the getPath(int ordinal) API iteratively instead.");
+          }
+        }
+        values.advanceExact(ord - leafReaderDocBase);

Review comment:
       Hmm can we `assert` that this indeed returned `true`?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,104 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void isOrdinalInIndexReaderRange(int ordinal, int indexReaderMaxDoc)

Review comment:
       Maybe rename to `checkOrdinalBounds`?   The `is` name prefix makes me think it returns a `boolean`.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,104 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void isOrdinalInIndexReaderRange(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} over an array of
+   * ordinals.
+   *
+   * <p>This API is only available for Lucene indexes created with 8.7+ codec because it uses
+   * BinaryDocValues instead of StoredFields. Use the {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} method for indices
+   * created with codec version older than 8.7.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   * @throws IOException if the taxonomy index is created using the older StoredFields based codec.
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    FacetLabel[] bulkPath = new FacetLabel[ordinals.length];
+    // remember the original positions of ordinals before they are sorted
+    IntIntScatterMap originalPosition = new IntIntScatterMap();
+    int indexReaderMaxDoc = indexReader.maxDoc();
+    for (int i = 0; i < ordinals.length; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      isOrdinalInIndexReaderRange(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinals[i], i);
+    }
+
+    Arrays.sort(ordinals);
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext = null;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinals) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null || ord > leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ord, indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses BinaryDocValues
+          boolean success = values.advanceExact(ord - leafReaderDocBase);
+          if (success == false) {
+            throw new IOException(

Review comment:
       Hmm, instead, could we just forward to the slow `getPath` API for this case?  And then update the javadocs to state that if you call this on a pre-8.7 written taxonomy index, it will "emulate" bulk by calling the single lookup API multiple times, i.e. no performance gains, but on newer indices, yes performance gains?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,104 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void isOrdinalInIndexReaderRange(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} over an array of

Review comment:
       Hmm does `{@link #getPath}` not work?  Does that somehow fail `gradle precommit` or something?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -320,18 +322,12 @@ public FacetLabel getPath(int ordinal) throws IOException {
     // doOpenIfChanged, we need to ensure that the ordinal is one that this DTR
     // instance recognizes. Therefore we do this check up front, before we hit
     // the cache.
-    if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
-      return null;
-    }
+    int indexReaderMaxDoc = indexReader.maxDoc();
+    isOrdinalInIndexReaderRange(ordinal, indexReaderMaxDoc);

Review comment:
       Ahh, I see, this is a (small) API break for `getPath`, throwing exception instead of leniently returning `null`, but I think it is the right change since it serves to expose bugs in the Lucene user's usage of facets APIs.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {

Review comment:
       Ahhh, I see!
   
   But then, shouldn't we just not enroll that ordinal/index into the map, up front?

##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -569,4 +569,32 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+    Directory src = newDirectory();
+    DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+    String randomArray[] = new String[random().nextInt(1000)];
+    Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt()));
+
+    FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+    int allOrdinals[] = new int[randomArray.length];
+
+    for (int i = 0; i < randomArray.length; i++) {
+      allPaths[i] = new FacetLabel(randomArray[i]);
+      w.addCategory(allPaths[i]);

Review comment:
       Could we randomly `w.commit()` in here with low but non-zero chance?  This way we will sometimes exercise a multi-segment taxo index.
   
   And, could you temporarily put the `assert` back from the first revision and confirm this test case eventually would catch that bug?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,104 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void isOrdinalInIndexReaderRange(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} over an array of
+   * ordinals.
+   *
+   * <p>This API is only available for Lucene indexes created with 8.7+ codec because it uses
+   * BinaryDocValues instead of StoredFields. Use the {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} method for indices
+   * created with codec version older than 8.7.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   * @throws IOException if the taxonomy index is created using the older StoredFields based codec.
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    FacetLabel[] bulkPath = new FacetLabel[ordinals.length];
+    // remember the original positions of ordinals before they are sorted
+    IntIntScatterMap originalPosition = new IntIntScatterMap();
+    int indexReaderMaxDoc = indexReader.maxDoc();
+    for (int i = 0; i < ordinals.length; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      isOrdinalInIndexReaderRange(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinals[i], i);
+    }
+
+    Arrays.sort(ordinals);
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext = null;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinals) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null || ord > leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ord, indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses BinaryDocValues
+          boolean success = values.advanceExact(ord - leafReaderDocBase);
+          if (success == false) {
+            throw new IOException(
+                "the taxonomy index is created using the older StoredFields format which uses a Lucene "
+                    + "codec older than 8.7. Use the getPath(int ordinal) API iteratively instead.");
+          }
+        }
+        values.advanceExact(ord - leafReaderDocBase);
+        bulkPath[originalPosition.get(ord)] =
+            new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+        synchronized (categoryCache) {

Review comment:
       We could move this `synchrtonized` out to a separate pass in the end, so we don't sync per ordinal but rather sync once, and then add all ordinals we looked up?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,104 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void isOrdinalInIndexReaderRange(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} over an array of
+   * ordinals.
+   *
+   * <p>This API is only available for Lucene indexes created with 8.7+ codec because it uses
+   * BinaryDocValues instead of StoredFields. Use the {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} method for indices
+   * created with codec version older than 8.7.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   * @throws IOException if the taxonomy index is created using the older StoredFields based codec.
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    FacetLabel[] bulkPath = new FacetLabel[ordinals.length];
+    // remember the original positions of ordinals before they are sorted
+    IntIntScatterMap originalPosition = new IntIntScatterMap();
+    int indexReaderMaxDoc = indexReader.maxDoc();
+    for (int i = 0; i < ordinals.length; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      isOrdinalInIndexReaderRange(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinals[i], i);

Review comment:
       Hmm, one reason for doing the parallel sort instead of the hash map is: what if the incoming ordinals have duplicates?  It is silly to do, but user might do it out of laziness or something, and as the code now stands, that would not work right, I think?  I.e. one of the returned `FacetLabel`s would be `null`?
   
   Could you add a test case for that (admittedly, strange) usage?
   
   And then I think switch to the parallel sort?  [Here is an example](https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java#L423-L476) of how to do this using Lucene's `Sorter` utility APIs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r566977877



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {

Review comment:
       It will not be `null` if we already find it in the cache in the initial step i.e the `FacetLabel` for this `ordinal` was found in the cache itself.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r572447037



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,104 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void isOrdinalInIndexReaderRange(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} over an array of
+   * ordinals.
+   *
+   * <p>This API is only available for Lucene indexes created with 8.7+ codec because it uses
+   * BinaryDocValues instead of StoredFields. Use the {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} method for indices
+   * created with codec version older than 8.7.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   * @throws IOException if the taxonomy index is created using the older StoredFields based codec.
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    FacetLabel[] bulkPath = new FacetLabel[ordinals.length];
+    // remember the original positions of ordinals before they are sorted
+    IntIntScatterMap originalPosition = new IntIntScatterMap();
+    int indexReaderMaxDoc = indexReader.maxDoc();
+    for (int i = 0; i < ordinals.length; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      isOrdinalInIndexReaderRange(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinals[i], i);
+    }
+
+    Arrays.sort(ordinals);
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext = null;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinals) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null || ord > leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ord, indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses BinaryDocValues
+          boolean success = values.advanceExact(ord - leafReaderDocBase);
+          if (success == false) {
+            throw new IOException(

Review comment:
       Done. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r566976943



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();

Review comment:
       This seems much better. Updated. 
   My initial impression was that `ensureOpen()` needs to be called before looking up each ordinal.

##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -569,4 +569,31 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testBulkPath() throws Exception {
+    Directory src = newDirectory();
+    DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+    String arr[] = new String[] {"a", "b", "c", "d", "e"};
+
+    FacetLabel allpaths[] = new FacetLabel[arr.length];
+    int allords[] = new int[arr.length];
+
+    for (int i = 0; i < arr.length; i++) {
+      allpaths[i] = new FacetLabel(arr[i]);
+      w.addCategory(allpaths[i]);
+    }
+    w.commit();
+    w.close();
+
+    DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+    for (int i = 0; i < allpaths.length; i++) {
+      allords[i] = r1.getOrdinal(allpaths[i]);
+    }
+
+    FacetLabel allBulkpath[] = r1.getBulkPath(allords);
+    assertArrayEquals(allpaths, allBulkpath);

Review comment:
       Added




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r572845064



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -31,7 +33,7 @@
 import org.apache.lucene.facet.taxonomy.ParallelTaxonomyArrays;
 import org.apache.lucene.facet.taxonomy.TaxonomyReader;
 import org.apache.lucene.index.BinaryDocValues;
-import org.apache.lucene.index.CorruptIndexException; // javadocs
+import org.apache.lucene.index.CorruptIndexException;

Review comment:
       Hmm, did we remove the `// javadocs` comment on purpose?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+   * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the values in the ordinals
+    // array
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses BinaryDocValues
+          boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+          if (success == false) {
+            return getBulkPathForOlderIndexes(ordinals);

Review comment:
       Hmm, I'm confused -- wouldn't an older index have no `BinaryDocValues` field?  So, `values` would be null, and we should fallback then?
   
   This code should hit `NullPointerException` on an old index I think?  How come our backwards compatibility test didn't expose this?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+   * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the values in the ordinals
+    // array
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses BinaryDocValues
+          boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+          if (success == false) {
+            return getBulkPathForOlderIndexes(ordinals);
+          }
+        }
+        boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+        assert success;
+        bulkPath[originalPosition[i]] =
+            new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+      }
+    }
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      synchronized (categoryCache) {
+        categoryCache.put(ordinals[i], bulkPath[originalPosition[i]]);

Review comment:
       We will sometimes put ordinals back into the cache that were already there at the start of this method right?  I guess that's harmless.  Or, maybe we should move this up above?  Then we can do it only for those ordinals that were not already cached?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+   * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the values in the ordinals
+    // array
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;

Review comment:
       Hmm, remove this lonely semi-colon?  Did `gradle precommit` pass?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+   * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the values in the ordinals
+    // array
+    new InPlaceMergeSorter() {

Review comment:
       Yay, thanks!

##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -569,4 +569,32 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+    Directory src = newDirectory();
+    DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+    String randomArray[] = new String[random().nextInt(1000)];
+    Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt()));
+
+    FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+    int allOrdinals[] = new int[randomArray.length];
+
+    for (int i = 0; i < randomArray.length; i++) {
+      allPaths[i] = new FacetLabel(randomArray[i]);
+      w.addCategory(allPaths[i]);

Review comment:
       OK, thanks for confirming.
   
   Hmm, that is spooky if `assert` is always enabled for Lucene tests!  We may have lost this functionality in the gradle migration?  Maybe, separately confirm this with a tiny test case, and if you cannot get that tiny test case to fail, send an email to `dev@` list asking about it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r566980263



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/

Review comment:
       Done in the recent commit




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r574162026



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+   * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the values in the ordinals
+    // array
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses BinaryDocValues
+          boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+          if (success == false) {
+            return getBulkPathForOlderIndexes(ordinals);
+          }
+        }
+        boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+        assert success;
+        bulkPath[originalPosition[i]] =
+            new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+      }
+    }
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      synchronized (categoryCache) {
+        categoryCache.put(ordinals[i], bulkPath[originalPosition[i]]);

Review comment:
       I think intuitively adding the ordinals back into the cache would not be a problem. This should also (theoretically) be faster than trying to get the lock again and again in a loop? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r566978754



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -314,7 +316,6 @@ public int getOrdinal(FacetLabel cp) throws IOException {
 
   @Override
   public FacetLabel getPath(int ordinal) throws IOException {
-    ensureOpen();

Review comment:
       In the earlier iteration it was shifted to the helper method that looked up values in the cache. I have kept it in the outer method in the most recent commit




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r572448749



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {

Review comment:
       It is being added to the map in the line that reads 
   ```
   if (ordinalPath != null) {
           bulkPath[i] = ordinalPath;
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#issuecomment-857311496


   > Once we iterate to a solid PR I am very curious how this helps facets performance -- we can switch luceneutil over to this bulk API to test.
   
   Today both `IntTaxonomyFacets` and `FloatTaxonomyFacets` iteratively call `getPath` on all the top ordinals and then return the `topChildren` `FacetLabels` that the user wanted. With this API change, we could switch over `IntTaxonomyFacets` and `FloatTaxonomyFacets` to use this bulk API. All downstream children `Facets` such as `FastTaxonomyFacetCount` use this base `getTopChildren` function so all users will be able to benefit from this change by default.
   
   Surprisingly,
   Our `luceneutil` benchmark **only** tests the `getTopChildren` API so we should be able to see the performance change with the stock `luceneutil` package. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r566988968



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null
+            || values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase) == false) {
+          readerIndex = ReaderUtil.subIndex(ord, indexReader.leaves());
+          LeafReader leafReader = indexReader.leaves().get(readerIndex).reader();
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+          assert values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase);

Review comment:
       Thanks for catching this. Fixed it in the new commit




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r578042299



##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -569,4 +569,32 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+    Directory src = newDirectory();
+    DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+    String randomArray[] = new String[random().nextInt(1000)];
+    Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt()));
+
+    FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+    int allOrdinals[] = new int[randomArray.length];
+
+    for (int i = 0; i < randomArray.length; i++) {
+      allPaths[i] = new FacetLabel(randomArray[i]);
+      w.addCategory(allPaths[i]);

Review comment:
       I got a small test case working that modified a global variable 1000 times (via an `assert`) and checked the final value. It seems that `assert` is always enabled. 
   
   I'll send an email about the same
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r572444675



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -320,18 +322,12 @@ public FacetLabel getPath(int ordinal) throws IOException {
     // doOpenIfChanged, we need to ensure that the ordinal is one that this DTR
     // instance recognizes. Therefore we do this check up front, before we hit
     // the cache.
-    if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
-      return null;
-    }
+    int indexReaderMaxDoc = indexReader.maxDoc();
+    isOrdinalInIndexReaderRange(ordinal, indexReaderMaxDoc);

Review comment:
       Yep. In fact, we had also created a spinoff [issue](https://issues.apache.org/jira/browse/LUCENE-9460) for addressing this expected behavior change in JIRA last year.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r566976157



##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -569,4 +569,31 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testBulkPath() throws Exception {

Review comment:
       Done. This is definitely better




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] muse-dev[bot] commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
muse-dev[bot] commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r572484392



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+   * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the values in the ordinals
+    // array
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          // this check is only needed once to confirm that the index uses BinaryDocValues
+          boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+          if (success == false) {
+            return getBulkPathForOlderIndexes(ordinals);
+          }
+        }
+        boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+        assert success;
+        bulkPath[originalPosition[i]] =
+            new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+      }
+    }
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      synchronized (categoryCache) {

Review comment:
       *THREAD_SAFETY_VIOLATION:*  Read/Write race. Non-private method `DirectoryTaxonomyReader.getBulkPath(...)` reads without synchronization from `this.categoryCache`. Potentially races with write in method `DirectoryTaxonomyReader.doClose()`.
    Reporting because this access may occur on a background thread.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -320,23 +323,16 @@ public FacetLabel getPath(int ordinal) throws IOException {
     // doOpenIfChanged, we need to ensure that the ordinal is one that this DTR
     // instance recognizes. Therefore we do this check up front, before we hit
     // the cache.
-    if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
-      return null;
-    }
+    int indexReaderMaxDoc = indexReader.maxDoc();
+    checkOrdinalBounds(ordinal, indexReaderMaxDoc);
 
-    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
-    // wrapped as LRU?
-    Integer catIDInteger = Integer.valueOf(ordinal);
-    synchronized (categoryCache) {
-      FacetLabel res = categoryCache.get(catIDInteger);
-      if (res != null) {
-        return res;
-      }
+    FacetLabel ordinalPath = getPathFromCache(ordinal);

Review comment:
       *THREAD_SAFETY_VIOLATION:*  Read/Write race. Non-private method `DirectoryTaxonomyReader.getPath(...)` indirectly reads without synchronization from `this.categoryCache`. Potentially races with write in method `DirectoryTaxonomyReader.doClose()`.
    Reporting because this access may occur on a background thread.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2247: WIP: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r564505813



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -314,7 +316,6 @@ public int getOrdinal(FacetLabel cp) throws IOException {
 
   @Override
   public FacetLabel getPath(int ordinal) throws IOException {
-    ensureOpen();

Review comment:
       Hmm why is this removed?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;

Review comment:
       Hmm can we instead throw exception w/ meaningful message?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);

Review comment:
       Instead of a `Map<Integer,Integer>` you could make two parallel arrays, and sort both arrays in a "matched" manner.  Lucene's sort APIs in `oal.util` make this simple, and there are example in Lucene's sources of "sorting parallel arrays".

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null
+            || values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase) == false) {
+          readerIndex = ReaderUtil.subIndex(ord, indexReader.leaves());
+          LeafReader leafReader = indexReader.leaves().get(readerIndex).reader();
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+          assert values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase);

Review comment:
       Hmm, this is dangerous: we are requiring that this `.advanceExact` be called, since we below use `.binaryValue()`?  But if user disables assertions, this won't happen, and then `.binaryValue()` might be wrong or throw an exception?
   
   I think you should do it in two steps, e.g.:
   
   ```
     boolean success = values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase);
     assert success;
   ```
   ?
   
   Also, why didn't the unit test uncover this failure?  Lucene's `test-framework` intentionally, randomly some percentage of the time, disables assertions precisely because of this situation :)  Long ago a Lucene release had a bug that accidentally "relied" on assertions being enabled in order to run correctly.
   
   Edit: OK, I see -- the unit test only creates a single segment index, so this line is not exercised I think (yet).

##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -569,4 +569,31 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testBulkPath() throws Exception {
+    Directory src = newDirectory();
+    DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+    String arr[] = new String[] {"a", "b", "c", "d", "e"};
+
+    FacetLabel allpaths[] = new FacetLabel[arr.length];
+    int allords[] = new int[arr.length];
+
+    for (int i = 0; i < arr.length; i++) {
+      allpaths[i] = new FacetLabel(arr[i]);
+      w.addCategory(allpaths[i]);
+    }
+    w.commit();
+    w.close();
+
+    DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+    for (int i = 0; i < allpaths.length; i++) {
+      allords[i] = r1.getOrdinal(allpaths[i]);
+    }
+
+    FacetLabel allBulkpath[] = r1.getBulkPath(allords);
+    assertArrayEquals(allpaths, allBulkpath);

Review comment:
       This is a good initial basic test case, thanks!
   
   Could you also add a randomized test, indexing random number of facet labels, hopefully exercising multiple segments randomly, and then doing bulk lookup of random collections of ordinals in random order?  It could help uncover exciting bugs :)

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {

Review comment:
       Hmm, why wouldn't this entry be `null`?  Shouldn't it always be `null`?  I.e. each requested ordinal goes to a unique position?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/

Review comment:
       Whoa, why is that!?  Oh, this is because it must be `BinaryDocValues` taxonomy index, OK.  If the user messes up and tries to use this API on stored fields, what happens?
   
   Also, can we enhance the javadoc to state what the method does, link to the non-bulk API, etc.?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null
+            || values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase) == false) {

Review comment:
       Hmm, instead of relying on `advanceExact` returning `false` to know when to advance to the next reader, could we instead use `.maxDoc` of the reader we are on?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();

Review comment:
       Maybe use `IntIntScatterMap` from HPPC instead, to create less garbage ... `facet` module already depends on HPPC.
   
   But this optimization can come later, and is likely negligible for most search use cases anyways.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {

Review comment:
       Maybe rename the parameter to `ordinals`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r572443600



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,104 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void isOrdinalInIndexReaderRange(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} over an array of
+   * ordinals.
+   *
+   * <p>This API is only available for Lucene indexes created with 8.7+ codec because it uses
+   * BinaryDocValues instead of StoredFields. Use the {@link
+   * org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader#getPath} method for indices
+   * created with codec version older than 8.7.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   * @throws IOException if the taxonomy index is created using the older StoredFields based codec.
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    FacetLabel[] bulkPath = new FacetLabel[ordinals.length];
+    // remember the original positions of ordinals before they are sorted
+    IntIntScatterMap originalPosition = new IntIntScatterMap();
+    int indexReaderMaxDoc = indexReader.maxDoc();
+    for (int i = 0; i < ordinals.length; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      isOrdinalInIndexReaderRange(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinals[i], i);

Review comment:
       Ooof. Missed this bug. I have added a small `Sorter` implementation for the parallel arrays (is this reusable code somewhere else as well?)
   
   Modified the new test case so that it is forced to use duplicate ordinals with a probability of 0.5.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r578017904



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+   * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the values in the ordinals
+    // array
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;

Review comment:
       Whoa, OK, I see!  Wild, it is supposed to be there, and on its own line, OK :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] shaie commented on a change in pull request #2247: WIP: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
shaie commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r564721526



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();

Review comment:
       I think Mike asked about it above -- it's better that we call this method in the outer *public* API so it's visible and clear we perform this check rather than rely on calling it in a private method.
   
   Also, potentially a `private` method may be called many times in the course of one `public` API invocation, and we don't need to perform this check over and over.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;

Review comment:
       It's a style comment, can we do this:
   
   ```
   FacetLabel res = null;
   sync (catCache) {
     res = catCache.get(ordinal);
   }
   return res;
   ```
   
   It's just shorter and I believe achieves the same outcome.
   
   In fact, the entire code can be changed to:
   
   ```
   sync(catCache) {
     return catCache.get(ordinal);
   }
   ```
   ??

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);

Review comment:
       There you go - you call the new private method in the context of a single public API invocation, therefore no need to perform the `ensureOpen` check for each ordinal 😄 

##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -569,4 +569,31 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testBulkPath() throws Exception {

Review comment:
       I know Lucene tests are written with that naming convention, but over the last few years I've started to name tests that describe what's being tested. e.g. `testCallingBulkPathReturnsCorrectResult` vs `testBulkPathFailsIfReaderIsClosed` ...
   
   Would be nice if we gradually started to name our tests with more descriptive names :)

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;

Review comment:
       I don't want to do premature optimization here, but `indexReader.maxDoc()` is called for each ordinal, but it cannot change between ordinals, so we could extract its value outside the loop?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null
+            || values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase) == false) {

Review comment:
       Again, don't want to prematurely optimize, but we could extract `docBase` outside the loop and update it after we update `readerIndex`?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {

Review comment:
       I'd also change the method to take `int... ordinal` instead? More convenient if you want to only get the path for one ordinal (even though the method is called `getBulkXXX`?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {

Review comment:
       Separately, and it's been a while since I looked at this code so take treat this comment cautiously -- it confuses me that the method is called `getBulkPath` where we return `FacetLabel` ... perhaps we should rename it to `getFacetLabels`?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null
+            || values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase) == false) {
+          readerIndex = ReaderUtil.subIndex(ord, indexReader.leaves());
+          LeafReader leafReader = indexReader.leaves().get(readerIndex).reader();
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+          assert values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase);

Review comment:
       @mikemccand it could also be that the test runs just didn't trip yet on the seed which disabled assertions for this code, right? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r572447577



##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -569,4 +569,32 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+    Directory src = newDirectory();
+    DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+    String randomArray[] = new String[random().nextInt(1000)];
+    Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt()));
+
+    FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+    int allOrdinals[] = new int[randomArray.length];
+
+    for (int i = 0; i < randomArray.length; i++) {
+      allPaths[i] = new FacetLabel(randomArray[i]);
+      w.addCategory(allPaths[i]);

Review comment:
       Done. 
   
   Thanks for the suggestion. I verified that this code trips when `values.advanceExact` is removed, **but** I was not able to verify this by just keeping the `assert values.advanceExact(ordinals[i] - leafReaderDocBase)` line. Perhaps assert was always being executed?
   
   I also manually verified (with print statements) that multi segments were created and were working as expected.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r574002264



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+   * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int originalPosition[] = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    // parallel sort the ordinals and originalPosition array based on the values in the ordinals
+    // array
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;

Review comment:
       `gradle check` (equivalent to precommit) passes. This semicolon is put automatically on a new line because of `./gradlew :lucene:facet:spotlessApply` (used for auto fixing style violations)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] gautamworah96 commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r566981233



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);

Review comment:
       Hmmm. I think using a map to remember initial positions is still simpler to understand? or I could still change it in the next revision




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org