You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/01/26 13:41:47 UTC

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2247: WIP: LUCENE-9476 Add getBulkPath API for the Taxonomy index

mikemccand commented on a change in pull request #2247:
URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r564505813



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -314,7 +316,6 @@ public int getOrdinal(FacetLabel cp) throws IOException {
 
   @Override
   public FacetLabel getPath(int ordinal) throws IOException {
-    ensureOpen();

Review comment:
       Hmm why is this removed?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;

Review comment:
       Hmm can we instead throw exception w/ meaningful message?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);

Review comment:
       Instead of a `Map<Integer,Integer>` you could make two parallel arrays, and sort both arrays in a "matched" manner.  Lucene's sort APIs in `oal.util` make this simple, and there are example in Lucene's sources of "sorting parallel arrays".

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null
+            || values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase) == false) {
+          readerIndex = ReaderUtil.subIndex(ord, indexReader.leaves());
+          LeafReader leafReader = indexReader.leaves().get(readerIndex).reader();
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+          assert values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase);

Review comment:
       Hmm, this is dangerous: we are requiring that this `.advanceExact` be called, since we below use `.binaryValue()`?  But if user disables assertions, this won't happen, and then `.binaryValue()` might be wrong or throw an exception?
   
   I think you should do it in two steps, e.g.:
   
   ```
     boolean success = values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase);
     assert success;
   ```
   ?
   
   Also, why didn't the unit test uncover this failure?  Lucene's `test-framework` intentionally, randomly some percentage of the time, disables assertions precisely because of this situation :)  Long ago a Lucene release had a bug that accidentally "relied" on assertions being enabled in order to run correctly.
   
   Edit: OK, I see -- the unit test only creates a single segment index, so this line is not exercised I think (yet).

##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -569,4 +569,31 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testBulkPath() throws Exception {
+    Directory src = newDirectory();
+    DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+    String arr[] = new String[] {"a", "b", "c", "d", "e"};
+
+    FacetLabel allpaths[] = new FacetLabel[arr.length];
+    int allords[] = new int[arr.length];
+
+    for (int i = 0; i < arr.length; i++) {
+      allpaths[i] = new FacetLabel(arr[i]);
+      w.addCategory(allpaths[i]);
+    }
+    w.commit();
+    w.close();
+
+    DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+    for (int i = 0; i < allpaths.length; i++) {
+      allords[i] = r1.getOrdinal(allpaths[i]);
+    }
+
+    FacetLabel allBulkpath[] = r1.getBulkPath(allords);
+    assertArrayEquals(allpaths, allBulkpath);

Review comment:
       This is a good initial basic test case, thanks!
   
   Could you also add a randomized test, indexing random number of facet labels, hopefully exercising multiple segments randomly, and then doing bulk lookup of random collections of ordinals in random order?  It could help uncover exciting bugs :)

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {

Review comment:
       Hmm, why wouldn't this entry be `null`?  Shouldn't it always be `null`?  I.e. each requested ordinal goes to a unique position?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/

Review comment:
       Whoa, why is that!?  Oh, this is because it must be `BinaryDocValues` taxonomy index, OK.  If the user messes up and tries to use this API on stored fields, what happens?
   
   Also, can we enhance the javadoc to state what the method does, link to the non-bulk API, etc.?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();
+    for (int i = 0; i < ordinal.length; i++) {
+      if (ordinal[i] < 0 || ordinal[i] >= indexReader.maxDoc()) {
+        return null;
+      }
+      FacetLabel ordinalPath = getPathFromCache(ordinal[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+      originalPosition.put(ordinal[i], i);
+    }
+
+    Arrays.sort(ordinal);
+    int readerIndex = 0;
+    BinaryDocValues values = null;
+
+    for (int ord : ordinal) {
+      if (bulkPath[originalPosition.get(ord)] == null) {
+        if (values == null
+            || values.advanceExact(ord - indexReader.leaves().get(readerIndex).docBase) == false) {

Review comment:
       Hmm, instead of relying on `advanceExact` returning `false` to know when to advance to the next reader, could we instead use `.maxDoc` of the reader we are on?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {
+    FacetLabel[] bulkPath = new FacetLabel[ordinal.length];
+    Map<Integer, Integer> originalPosition = new HashMap<>();

Review comment:
       Maybe use `IntIntScatterMap` from HPPC instead, to create less garbage ... `facet` module already depends on HPPC.
   
   But this optimization can come later, and is likely negligible for most search use cases anyways.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -353,12 +349,65 @@ public FacetLabel getPath(int ordinal) throws IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    ensureOpen();
+
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      FacetLabel res = categoryCache.get(ordinal);
+      if (res != null) {
+        return res;
+      }
+    }
+    return null;
+  }
+
+  /* This API is only supported for indexes created with Lucene 8.7+ codec **/
+  public FacetLabel[] getBulkPath(int[] ordinal) throws IOException {

Review comment:
       Maybe rename the parameter to `ordinals`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org