You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by "AlexanderSaydakov (via GitHub)" <gi...@apache.org> on 2023/02/23 20:23:18 UTC

[GitHub] [datasketches-java] AlexanderSaydakov commented on a diff in pull request #428: Theta compression

AlexanderSaydakov commented on code in PR #428:
URL: https://github.com/apache/datasketches-java/pull/428#discussion_r1116210095


##########
src/main/java/org/apache/datasketches/theta/CompactSketch.java:
##########
@@ -285,4 +249,128 @@ public boolean isCompact() {
     return true;
   }
 
+  public byte[] toByteArrayCompressed() {
+    if (!isOrdered() || getRetainedEntries() == 0 || (getRetainedEntries() == 1 && !isEstimationMode())) {
+      return toByteArray();
+    }
+    return toByteArrayV4();
+  }
+
+  private int computeMinLeadingZeros() {
+    // compression is based on leading zeros in deltas between ordered hash values
+    // assumes ordered sketch
+    long previous = 0;
+    long ored = 0;
+    HashIterator it = iterator();
+    while (it.next()) {
+      final long delta = it.get() - previous;
+      ored |= delta;
+      previous = it.get();
+    }
+    return Long.numberOfLeadingZeros(ored);

Review Comment:
   delta encoding makes the most sense for ordered values. this proposed code does not compress unordered sketches. toByteArrayCompressed() delegates to V3 toByteArray() for unordered, empty and single-item sketches.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org