You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by "davecromberge (via GitHub)" <gi...@apache.org> on 2023/02/23 16:13:14 UTC

[GitHub] [datasketches-java] davecromberge commented on a diff in pull request #428: Theta compression

davecromberge commented on code in PR #428:
URL: https://github.com/apache/datasketches-java/pull/428#discussion_r1115899337


##########
src/main/java/org/apache/datasketches/theta/CompactSketch.java:
##########
@@ -285,4 +249,128 @@ public boolean isCompact() {
     return true;
   }
 
+  public byte[] toByteArrayCompressed() {
+    if (!isOrdered() || getRetainedEntries() == 0 || (getRetainedEntries() == 1 && !isEstimationMode())) {
+      return toByteArray();
+    }
+    return toByteArrayV4();
+  }
+
+  private int computeMinLeadingZeros() {
+    // compression is based on leading zeros in deltas between ordered hash values
+    // assumes ordered sketch
+    long previous = 0;
+    long ored = 0;
+    HashIterator it = iterator();
+    while (it.next()) {
+      final long delta = it.get() - previous;
+      ored |= delta;
+      previous = it.get();
+    }
+    return Long.numberOfLeadingZeros(ored);

Review Comment:
   To verify understanding, are the deltas of the hashes being bit computed to more efficiently bit-pack?  Is the assumption above about ordering ever invalid?



##########
src/main/java/org/apache/datasketches/theta/CompactSketch.java:
##########
@@ -107,27 +91,34 @@ public static CompactSketch heapify(final Memory srcMem) {
    * @return a CompactSketch on the heap.
    */
   public static CompactSketch heapify(final Memory srcMem, final long expectedSeed) {
-    final int serVer = srcMem.getByte(SER_VER_BYTE);
-    final byte familyID = srcMem.getByte(FAMILY_BYTE);
+    return heapify(srcMem, expectedSeed, true);
+  }
 
+  private static CompactSketch heapify(final Memory srcMem, final long seed, final boolean enforceSeed) {
+    final int serVer = extractSerVer(srcMem);
+    final int familyID = extractFamilyID(srcMem);
     final Family family = idToFamily(familyID);
     if (family != Family.COMPACT) {
       throw new IllegalArgumentException("Corrupted: " + family + " is not Compact!");
     }
+    if (serVer == 4) {
+       return heapifyV4(srcMem, seed, enforceSeed);
+    }
     if (serVer == 3) {
-      final int flags = PreambleUtil.extractFlags(srcMem);
+      final int flags = extractFlags(srcMem);
       final boolean srcOrdered = (flags & ORDERED_FLAG_MASK) != 0;
       final boolean empty = (flags & EMPTY_FLAG_MASK) != 0;
-      if (!empty) { PreambleUtil.checkMemorySeedHash(srcMem, expectedSeed); }
+      if (enforceSeed && !empty) { PreambleUtil.checkMemorySeedHash(srcMem, seed); }
       return CompactOperations.memoryToCompact(srcMem, srcOrdered, null);
     }
     //not SerVer 3, assume compact stored form
-    final short seedHash = ThetaUtil.computeSeedHash(expectedSeed);
+    final short seedHash = ThetaUtil.computeSeedHash(seed);
     if (serVer == 1) {
       return ForwardCompatibility.heapify1to3(srcMem, seedHash);
     }
     if (serVer == 2) {
-      return ForwardCompatibility.heapify2to3(srcMem, seedHash);
+      return ForwardCompatibility.heapify2to3(srcMem,
+          enforceSeed ? seedHash : (short) extractSeedHash(srcMem));

Review Comment:
   To clarify the behaviour here for serVer 3 - it looks as if these are supported but once written will be compacted more efficiently using serVer 4 when the keys are ordered.  Is this correct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org