You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/10 15:24:28 UTC

[GitHub] [lucene] jpountz commented on a diff in pull request #875: LUCENE-10560: Speed up OrdinalMap construction a bit.

jpountz commented on code in PR #875:
URL: https://github.com/apache/lucene/pull/875#discussion_r869377332


##########
lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java:
##########
@@ -48,10 +49,69 @@ public class OrdinalMap implements Accountable {
   // need it
   // TODO: use more efficient packed ints structures?
 
+  /**
+   * Copy the first 8 bytes of the given term as a comparable unsigned long. In case the term has
+   * less than 8 bytes, missing bytes will be replaced with zeroes. Note that two terms that produce
+   * the same long could still be different due to the fact that missing bytes are replaced with
+   * zeroes, e.g. {@code [1, 0]} and {@code [1]} get mapped to the same long.
+   */
+  static long prefix8ToComparableUnsignedLong(BytesRef term) {
+    // Use Big Endian so that longs are comparable
+    if (term.length >= Long.BYTES) {
+      return (long) BitUtil.VH_BE_LONG.get(term.bytes, term.offset);
+    } else {
+      long l;
+      int offset;
+      if (term.length >= Integer.BYTES) {
+        l = (int) BitUtil.VH_BE_INT.get(term.bytes, term.offset);
+        offset = Integer.BYTES;
+      } else {
+        l = 0;
+        offset = 0;
+      }
+      while (offset < term.length) {
+        l = (l << 8) | Byte.toUnsignedLong(term.bytes[term.offset + offset]);
+        offset++;
+      }
+      l <<= (Long.BYTES - term.length) << 3;
+      return l;
+    }
+  }
+
+  private static int compare(BytesRef termA, long prefix8A, BytesRef termB, long prefix8B) {
+    assert prefix8A == prefix8ToComparableUnsignedLong(termA);
+    assert prefix8B == prefix8ToComparableUnsignedLong(termB);
+    if (prefix8A != prefix8B) {
+      // Terms differ in their first 8 bytes, compare prefixes
+      int cmp = Long.compareUnsigned(prefix8A, prefix8B);
+      assert Integer.signum(cmp)
+              == Integer.signum(
+                  Arrays.compareUnsigned(
+                      termA.bytes,
+                      termA.offset,
+                      termA.offset + termA.length,
+                      termB.bytes,
+                      termB.offset,
+                      termB.offset + termB.length))
+          : termA + " " + termB + " " + cmp;
+      return cmp;
+    } else {
+      // Compare terms
+      return Arrays.compareUnsigned(

Review Comment:
   > In fact, if the term length is smaller than 8 no comparison is needed at all (they're equal).
   > if prefixes are the same then this can skip the leading characters?
   
   I tried both these changes and they did not help significantly. FWIW if the long is the same, we still need to check whether terms have the same length, otherwise we could hit corner cases such as the fact that `{1, 0}` and `{1}` get mapped to the same long. This introduced some complexity and didn't seem to help much in my benchmarks, so I tried to keep things as simple as possible.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org