You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by "leerho (via GitHub)" <gi...@apache.org> on 2024/02/14 02:39:44 UTC

[PR] Improve partition boundaries (datasketches-java)

leerho opened a new pull request, #505:
URL: https://github.com/apache/datasketches-java/pull/505

   This fixes a bug discovered in recent Druid testing.
   
   Makes other improvements to the getPartionBoundaries code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [PR] Improve partition boundaries (datasketches-java)

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho commented on code in PR #505:
URL: https://github.com/apache/datasketches-java/pull/505#discussion_r1490214130


##########
src/test/java/org/apache/datasketches/sampling/EbppsSketchTest.java:
##########
@@ -22,8 +22,10 @@
 import static org.testng.Assert.assertEquals;
 import static org.testng.Assert.assertFalse;
 import static org.testng.Assert.assertTrue;
+import static org.testng.Assert.fail;
 
 import java.util.ArrayList;
+import java.util.Arrays;

Review Comment:
   Yes, and so is: import static org.testng.Assert.fail;
   I will fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [PR] Improve partition boundaries (datasketches-java)

Posted by "jmalkin (via GitHub)" <gi...@apache.org>.
jmalkin commented on code in PR #505:
URL: https://github.com/apache/datasketches-java/pull/505#discussion_r1490204855


##########
src/test/java/org/apache/datasketches/sampling/EbppsSketchTest.java:
##########
@@ -22,8 +22,10 @@
 import static org.testng.Assert.assertEquals;
 import static org.testng.Assert.assertFalse;
 import static org.testng.Assert.assertTrue;
+import static org.testng.Assert.fail;
 
 import java.util.ArrayList;
+import java.util.Arrays;

Review Comment:
   This is probably extraneous?



##########
src/main/java/org/apache/datasketches/quantiles/ItemsSketchSortedView.java:
##########
@@ -155,27 +159,68 @@ public GenericPartitionBoundaries<T> getPartitionBoundaries(final int numEqually
       final QuantileSearchCriteria searchCrit) {
     if (isEmpty()) { throw new IllegalArgumentException(QuantilesAPI.EMPTY_MSG); }
     final long totalN = this.totalN;
+    final double delta = getNormalizedRankError(k, true) * totalN;
+    final int maxParts = (int) (totalN / Math.ceil(delta * 2) );
     final int svLen = cumWeights.length;
-    //adjust ends of sortedView arrays
-    cumWeights[0] = 1L;
-    cumWeights[svLen - 1] = totalN;
-    quantiles[0] = this.getMinItem();
-    quantiles[svLen - 1] = this.getMaxItem();
-
-    final double[] evSpNormRanks = evenlySpacedDoubles(0, 1.0, numEquallySized + 1);
-    final int len = evSpNormRanks.length;
-    final T[] evSpQuantiles = (T[]) Array.newInstance(clazz, len);
-    final long[] evSpNatRanks = new long[len];
-    for (int i = 0; i < len; i++) {
-      final int index = getQuantileIndex(evSpNormRanks[i], searchCrit);
-      evSpQuantiles[i] = quantiles[index];
-      evSpNatRanks[i] = cumWeights[index];
+    if (numEquallySized > maxParts) {
+      throw new SketchesArgumentException(QuantilesAPI.UNSUPPORTED_MSG
+          + "Requested number of partitions is too large for this sized sketch. "
+          + "It exceeds maximum number of partitions allowed by the error threshold for this size sketch."
+          + "Requested Partitions: " + numEquallySized + " > " + maxParts);
+    }
+    if (numEquallySized > svLen / 2) {
+      throw new SketchesArgumentException(QuantilesAPI.UNSUPPORTED_MSG
+          + "Requested number of partitions is too large for this sized sketch. "
+          + "It exceeds maximum number of retained items divided by 2."
+          + "Requested Partitions: " + numEquallySized + " > "
+          + "Retained Items / 2: " + (svLen / 2));
+    }
+
+    final double[] searchNormRanks = evenlySpacedDoubles(0, 1.0, numEquallySized + 1);
+    final int partArrLen = searchNormRanks.length;
+    final T[] partQuantiles = (T[]) Array.newInstance(clazz, partArrLen);
+    final long[] partNatRanks = new long[partArrLen];
+    final double[] partNormRanks = new double[partArrLen];
+    //adjust ends
+    int adjLen = svLen;
+    final boolean adjLow = quantiles[0] != minItem;
+    final boolean adjHigh = quantiles[svLen - 1] != maxItem;
+    adjLen += adjLow ? 1 : 0;
+    adjLen += adjHigh ? 1 : 0;
+
+    final T[] adjQuantiles;

Review Comment:
   This is the start of a long code block with no indication of what it's doing. It'd be useful to have a little description of what's going on here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [PR] Improve partition boundaries (datasketches-java)

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho commented on code in PR #505:
URL: https://github.com/apache/datasketches-java/pull/505#discussion_r1490214423


##########
src/main/java/org/apache/datasketches/quantiles/ItemsSketchSortedView.java:
##########
@@ -155,27 +159,68 @@ public GenericPartitionBoundaries<T> getPartitionBoundaries(final int numEqually
       final QuantileSearchCriteria searchCrit) {
     if (isEmpty()) { throw new IllegalArgumentException(QuantilesAPI.EMPTY_MSG); }
     final long totalN = this.totalN;
+    final double delta = getNormalizedRankError(k, true) * totalN;
+    final int maxParts = (int) (totalN / Math.ceil(delta * 2) );
     final int svLen = cumWeights.length;
-    //adjust ends of sortedView arrays
-    cumWeights[0] = 1L;
-    cumWeights[svLen - 1] = totalN;
-    quantiles[0] = this.getMinItem();
-    quantiles[svLen - 1] = this.getMaxItem();
-
-    final double[] evSpNormRanks = evenlySpacedDoubles(0, 1.0, numEquallySized + 1);
-    final int len = evSpNormRanks.length;
-    final T[] evSpQuantiles = (T[]) Array.newInstance(clazz, len);
-    final long[] evSpNatRanks = new long[len];
-    for (int i = 0; i < len; i++) {
-      final int index = getQuantileIndex(evSpNormRanks[i], searchCrit);
-      evSpQuantiles[i] = quantiles[index];
-      evSpNatRanks[i] = cumWeights[index];
+    if (numEquallySized > maxParts) {
+      throw new SketchesArgumentException(QuantilesAPI.UNSUPPORTED_MSG
+          + "Requested number of partitions is too large for this sized sketch. "
+          + "It exceeds maximum number of partitions allowed by the error threshold for this size sketch."
+          + "Requested Partitions: " + numEquallySized + " > " + maxParts);
+    }
+    if (numEquallySized > svLen / 2) {
+      throw new SketchesArgumentException(QuantilesAPI.UNSUPPORTED_MSG
+          + "Requested number of partitions is too large for this sized sketch. "
+          + "It exceeds maximum number of retained items divided by 2."
+          + "Requested Partitions: " + numEquallySized + " > "
+          + "Retained Items / 2: " + (svLen / 2));
+    }
+
+    final double[] searchNormRanks = evenlySpacedDoubles(0, 1.0, numEquallySized + 1);
+    final int partArrLen = searchNormRanks.length;
+    final T[] partQuantiles = (T[]) Array.newInstance(clazz, partArrLen);
+    final long[] partNatRanks = new long[partArrLen];
+    final double[] partNormRanks = new double[partArrLen];
+    //adjust ends
+    int adjLen = svLen;
+    final boolean adjLow = quantiles[0] != minItem;
+    final boolean adjHigh = quantiles[svLen - 1] != maxItem;
+    adjLen += adjLow ? 1 : 0;
+    adjLen += adjHigh ? 1 : 0;
+
+    final T[] adjQuantiles;

Review Comment:
   Good feedback.  Will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [PR] Improve partition boundaries (datasketches-java)

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho merged PR #505:
URL: https://github.com/apache/datasketches-java/pull/505


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org