You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "davecromberge (via GitHub)" <gi...@apache.org> on 2023/11/22 14:26:27 UTC

[PR] Theta Sketch Aggregation Enhancements [pinot]

davecromberge opened a new pull request, #12042:
URL: https://github.com/apache/pinot/pull/12042

   Introduces additional parameters to the DistinctCountThetaSketch aggregation function that give the end-user more control over how sketches are merged.  The defaults are selected to ensure that the behaviour remains unchanged over the current implementation.
   
   Furthermore, an accumulator custom object is added to ensure that pairwise union operations are avoided as much as possible.  Instead, sketches can be aggregated and merged when a threshold is met.
   
   This PR is a `performance` enhancement and can be tagged/labelled as such.
   
   `release-notes`:
   - add configuration options for DistinctCountThetaSketchAggregationFunction
   - respect ordering for existing Theta sketches to use "early-stop" optimisations for unions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1414237048


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/CustomSerDeUtils.java:
##########
@@ -243,9 +243,14 @@ public TDigest deserialize(ByteBuffer byteBuffer) {
 
     @Override
     public byte[] serialize(Sketch value) {
-      // NOTE: Compact the sketch in unsorted, on-heap fashion for performance concern.
-      //       See https://datasketches.apache.org/docs/Theta/ThetaSize.html for more details.
-      return value.compact(false, null).toByteArray();
+      // The serializer should respect existing ordering to enable "early stop"
+      // optimisations on unions.
+      boolean shouldCompact = !value.isCompact();
+      boolean shouldOrder = value.isOrdered();

Review Comment:
   I think those points are valid - but I thought that it would be best to leave the decision as to whether to order in the hands of the end user.  The current version respects the input ordering.  
   
   Moreover, the sketch is not re-compacted if it is already in compacted form.
   
   This is a rather subjective take on the problem, and I will go with your recommendation on this one @snleee - what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408415739


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/CustomSerDeUtils.java:
##########
@@ -243,9 +243,14 @@ public TDigest deserialize(ByteBuffer byteBuffer) {
 
     @Override
     public byte[] serialize(Sketch value) {
-      // NOTE: Compact the sketch in unsorted, on-heap fashion for performance concern.
-      //       See https://datasketches.apache.org/docs/Theta/ThetaSize.html for more details.
-      return value.compact(false, null).toByteArray();
+      // The serializer should respect existing ordering to enable "early stop"
+      // optimisations on unions.
+      boolean shouldCompact = !value.isCompact();
+      boolean shouldOrder = value.isOrdered();
+      if (shouldCompact) {

Review Comment:
   No - this is not the case and it depends on how the ingested sketches have been processed by the end user.  It might be the case that this is true for the empty and singleton sketch (sketch of a single hashed item), but not for others.
   The emphasis on these changes is that the serialiser does not alter the current ordering on existing sketches.  There is a small performance optimisation to this in that there is no need to re-compact or re-order sketches that are already in that state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408420008


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/customobject/ThetaSketchAccumulator.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.customobject;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import javax.annotation.Nonnull;
+import org.apache.datasketches.theta.SetOperationBuilder;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+
+
+/**
+ * Intermediate state used by {@code DistinctCountThetaSketchAggregationFunction} which gives
+ * the end user more control over how sketches are merged for performance.
+ * The end user can set parameters that trade-off more memory usage for more pre-aggregation.
+ * This permits use of the Union "early-stop" optimisation where ordered sketches require no further
+ * processing beyond the minimum Theta value.
+ * The union operation initialises an empty "gadget" bookkeeping sketch that is updated with hashed entries
+ * that fall below the minimum Theta value for all input sketches ("Broder Rule").  When the initial
+ * Theta value is set to the minimum immediately, further gains can be realised.
+ */
+public class ThetaSketchAccumulator {
+  private ArrayList<Sketch> _accumulator;
+  private boolean _ordered = false;
+  private SetOperationBuilder _setOperationBuilder = new SetOperationBuilder();
+  private Union _union;
+  private int _threshold;
+  private int _numInputs = 0;
+
+  public ThetaSketchAccumulator() {
+  }
+
+  // Note: The accumulator is serialized as a sketch.  This means that the majority of the processing
+  // happens on serialization. Therefore, when deserialized, the values may be null and will
+  // require re-initialisation. Since the primary use case is at query time for the Broker
+  // and Server, these properties are already in memory and are re-set.
+  public ThetaSketchAccumulator(SetOperationBuilder setOperationBuilder, boolean ordered, int threshold) {
+    _setOperationBuilder = setOperationBuilder;
+    _ordered = ordered;
+    _threshold = threshold;
+  }
+
+  public void setOrdered(boolean ordered) {
+    _ordered = ordered;
+  }
+
+  public void setSetOperationBuilder(SetOperationBuilder setOperationBuilder) {
+    _setOperationBuilder = setOperationBuilder;
+  }
+
+  public void setThreshold(int threshold) {
+    _threshold = threshold;
+  }
+
+  public boolean isEmpty() {
+    return _numInputs == 0;
+  }
+
+  @Nonnull
+  public Sketch getResult() {
+    return unionAll();
+  }
+
+  public void apply(Sketch sketch) {
+    internalAdd(sketch);
+  }
+
+  public void merge(ThetaSketchAccumulator thetaUnion) {
+    if (thetaUnion.isEmpty()) {
+      return;
+    }
+    Sketch sketch = thetaUnion.getResult();
+    internalAdd(sketch);
+  }
+
+  private void internalAdd(Sketch sketch) {
+    if (sketch.isEmpty()) {
+      return;
+    }
+    if (_accumulator == null) {
+      _accumulator = new ArrayList<>(_threshold);
+    }
+    _accumulator.add(sketch);
+    _numInputs += 1;
+
+    if (_accumulator.size() >= _threshold) {

Review Comment:
   That's right - the result is extracted when the server returns a result to the broker or when the broker performs the final post aggregation operations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408417423


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/customobject/ThetaSketchAccumulator.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.customobject;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import javax.annotation.Nonnull;
+import org.apache.datasketches.theta.SetOperationBuilder;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+
+
+/**
+ * Intermediate state used by {@code DistinctCountThetaSketchAggregationFunction} which gives
+ * the end user more control over how sketches are merged for performance.
+ * The end user can set parameters that trade-off more memory usage for more pre-aggregation.
+ * This permits use of the Union "early-stop" optimisation where ordered sketches require no further
+ * processing beyond the minimum Theta value.
+ * The union operation initialises an empty "gadget" bookkeeping sketch that is updated with hashed entries
+ * that fall below the minimum Theta value for all input sketches ("Broder Rule").  When the initial
+ * Theta value is set to the minimum immediately, further gains can be realised.
+ */
+public class ThetaSketchAccumulator {
+  private ArrayList<Sketch> _accumulator;
+  private boolean _ordered = false;
+  private SetOperationBuilder _setOperationBuilder = new SetOperationBuilder();
+  private Union _union;
+  private int _threshold;
+  private int _numInputs = 0;
+
+  public ThetaSketchAccumulator() {
+  }
+
+  // Note: The accumulator is serialized as a sketch.  This means that the majority of the processing
+  // happens on serialization. Therefore, when deserialized, the values may be null and will
+  // require re-initialisation. Since the primary use case is at query time for the Broker
+  // and Server, these properties are already in memory and are re-set.
+  public ThetaSketchAccumulator(SetOperationBuilder setOperationBuilder, boolean ordered, int threshold) {
+    _setOperationBuilder = setOperationBuilder;
+    _ordered = ordered;
+    _threshold = threshold;
+  }
+
+  public void setOrdered(boolean ordered) {
+    _ordered = ordered;
+  }
+
+  public void setSetOperationBuilder(SetOperationBuilder setOperationBuilder) {
+    _setOperationBuilder = setOperationBuilder;
+  }
+
+  public void setThreshold(int threshold) {
+    _threshold = threshold;
+  }
+
+  public boolean isEmpty() {
+    return _numInputs == 0;
+  }
+
+  @Nonnull
+  public Sketch getResult() {
+    return unionAll();
+  }
+
+  public void apply(Sketch sketch) {
+    internalAdd(sketch);
+  }
+
+  public void merge(ThetaSketchAccumulator thetaUnion) {
+    if (thetaUnion.isEmpty()) {
+      return;
+    }
+    Sketch sketch = thetaUnion.getResult();
+    internalAdd(sketch);
+  }
+
+  private void internalAdd(Sketch sketch) {
+    if (sketch.isEmpty()) {
+      return;
+    }
+    if (_accumulator == null) {
+      _accumulator = new ArrayList<>(_threshold);
+    }
+    _accumulator.add(sketch);
+    _numInputs += 1;
+
+    if (_accumulator.size() >= _threshold) {

Review Comment:
   Yes, correct.  The early stop mechanism and internal book-keeping structure (a QuickSelect updateable sketch on heap) need only be created for a single merge.  The penalty that the end user pays for this amonisation is that the merge can consume more memory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1830765677

   > Thanks for the detailed description and the reasoning behind this approach. Intuitively this approach make sense. I took a first pass of the PR and have some high level questions
   > 
   > 1. There are 4 new params introduced. Have we quantified the gains for each of these params and which one yields the largest gains? Im assuming these params work independent of each other.
   > 2. We have nominal entries param for a sketch (which is the number of entries in a sketch?). Curious if we have already experimented tuning this param to figure out the gains ?  How can this parameter impact performance.
   > 3. Could you share the benchmark results/numbers for different values of the params
   
   Thank you for your review @swaminathanmanish.  I found your questions very insightful.  I should make it clear at the outset that it has been difficult to thoroughly benchmark the work in this pull request in a test environment.  Some of the performance improvements are knobs to turn and are speculative in how they could behave.
   
   Therefore I would like to propose that the parameters are retained and tested in a production environment, and then subsequently removed should they show no user benefit.  The downside to this approach is that backward compatibility will not be maintained should end users start to depend on them and use them.   As for your questions, answers follow inline.
   
   > 1. There are 4 new params introduced. Have we quantified the gains for each of these params and which one yields the largest gains? Im assuming these params work independent of each other.
   
   The largest gain can be realised through adjusting the sampling probability parameter.  However, this is highly dependent on the use case and should only be used in certain circumstances.  All parameters are independent of each other and have been selected to have default behaviour that retains the existing behaviour of the system.  The performance gains measured in testing are between 25% and 50% performance improvement.  Where the results are sampled, the speed increases by 300%.
   
   > 2. We have nominal entries param for a sketch (which is the number of entries in a sketch?). Curious if we have already experimented tuning this param to figure out the gains ?  How can this parameter impact performance.
   
   We have been using this extensively, but it does not help where there is a large tail of sketches that have retained items less than nominal entries.  For these cases, selecting lower nominal entries can impact the error across the board.  Instead, using sampling probability allows the end user to curtail size and increase error on the long tail, which, for some reports and queries, is a good tradeoff.
   
   > 3. Could you share the benchmark results/numbers for different values of the params
   
   I'd like to do more testing in a production environment.  But, the default set of parameters can be up to 25% faster than the current implementation (through sorting and retaining unions).  However, adjusting others such as sampling has shown orders of magnitude gains - but, there are certain tradeoffs on accuracy that need to be considered.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1825531902

   I'm converting this to a draft in order to gather feedback on the approach.  I'm busy exploring point 2 in the alternatives section above which might lead to a simpler implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "jackjlli (via GitHub)" <gi...@apache.org>.
jackjlli commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1899555330

   Hey @davecromberge @snleee @swaminathanmanish , thanks for making the contribution to theta sketch functionality! We really appreciate that! While this PR introduces the backward compatibility that the data type is switched from `Sketch` to `ThetaSketchAccumulator`, which caused the existing tables running with `distinctCountThetaSketch` or `distinctCountRawThetaSketch` functionalities fail during the deployment time because of two different OSS versions being present in brokers and servers at the same time. This PR should have been marked as `backward incompatible`. And this is blocking our deployment in LinkedIn. Could you please adjust the code to make sure this enhancement won't break the existing running tables?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1822891037

   Notes for reviewers:
   ===============
   
   The changes in this PR introduce a different intermediate result type for the DistinctCountThetaSketch aggregation function.  The reason for doing so is to eliminate the intermediate bookkeeping necessary to perform unions on two sketches at a time.   
    Moreover, the "early-stop" optimisation in set operation circumvents further processing when retained items fall above the minimum theta value.  This applies to other post set operation expressions as well.
   
   When using JMH benchmarks to simulate this scenario, the speedup achieved by accumulating sketches prior to union is often an improvement by a factor of 3.  
   However, I have not been able to observe this speedup in actual test queries due to limitations in my test environment.  Instead, this will become more important when sketches require merging from thousands of segments on a single server.
   
   Alternatives considered
   ------------------------
   
   1.  Only introducing additional parameters - this avoids the complexity in accumulating intermediate sketches and still gives the end user the option to use p-sampling.
   
   2. Using Union as the intermediate type.  The Datasketches library exposes the ability to serialize a union as bytes but does not expose the capability to use `fastWrap` which deserialises the union back into an object on the heap.  This would be simpler than the approach in this PR and if desired I can work with the Datasketches team to add this capability.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1822977441

   ## [Codecov](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   Attention: `167 lines` in your changes are missing coverage. Please review.
   > Comparison is base [(`e4d3a8d`)](https://app.codecov.io/gh/apache/pinot/commit/e4d3a8dde8df8be8fcc3ae20c96ec6d004dd7933?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 61.64% compared to head [(`4fd6650`)](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 0.00%.
   
   > :exclamation: Current head 4fd6650 differs from pull request most recent head 6e45abc. Consider uploading reports for the commit 6e45abc to get more accurate results
   
   | [Files](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines |
   |---|---|---|
   | [...n/DistinctCountThetaSketchAggregationFunction.java](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9hZ2dyZWdhdGlvbi9mdW5jdGlvbi9EaXN0aW5jdENvdW50VGhldGFTa2V0Y2hBZ2dyZWdhdGlvbkZ1bmN0aW9uLmphdmE=) | 0.00% | [86 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [...ent/local/customobject/ThetaSketchAccumulator.java](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9jdXN0b21vYmplY3QvVGhldGFTa2V0Y2hBY2N1bXVsYXRvci5qYXZh) | 0.00% | [46 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [...org/apache/pinot/core/common/ObjectSerDeUtils.java](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9jb21tb24vT2JqZWN0U2VyRGVVdGlscy5qYXZh) | 0.00% | [20 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [...istinctCountRawThetaSketchAggregationFunction.java](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9hZ2dyZWdhdGlvbi9mdW5jdGlvbi9EaXN0aW5jdENvdW50UmF3VGhldGFTa2V0Y2hBZ2dyZWdhdGlvbkZ1bmN0aW9uLmphdmE=) | 0.00% | [10 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [...he/pinot/segment/local/utils/CustomSerDeUtils.java](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9DdXN0b21TZXJEZVV0aWxzLmphdmE=) | 0.00% | [5 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   
   <details><summary>Additional details and impacted files</summary>
   
   
   ```diff
   @@              Coverage Diff              @@
   ##             master   #12042       +/-   ##
   =============================================
   - Coverage     61.64%    0.00%   -61.65%     
   =============================================
     Files          2385     2310       -75     
     Lines        129519   125890     -3629     
     Branches      20043    19502      -541     
   =============================================
   - Hits          79844        0    -79844     
   - Misses        43869   125890    +82021     
   + Partials       5806        0     -5806     
   ```
   
   | [Flag](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [custom-integration1](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [integration](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-0.01%)` | :arrow_down: |
   | [integration1](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [integration2](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (ø)` | |
   | [java-11](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-61.59%)` | :arrow_down: |
   | [java-21](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-61.52%)` | :arrow_down: |
   | [skip-bytebuffers-false](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-61.61%)` | :arrow_down: |
   | [skip-bytebuffers-true](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-61.50%)` | :arrow_down: |
   | [temurin](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-61.65%)` | :arrow_down: |
   | [unittests](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [unittests1](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [unittests2](https://app.codecov.io/gh/apache/pinot/pull/12042/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   
   </details>
   
   [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/pinot/pull/12042?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).   
   :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408415739


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/CustomSerDeUtils.java:
##########
@@ -243,9 +243,14 @@ public TDigest deserialize(ByteBuffer byteBuffer) {
 
     @Override
     public byte[] serialize(Sketch value) {
-      // NOTE: Compact the sketch in unsorted, on-heap fashion for performance concern.
-      //       See https://datasketches.apache.org/docs/Theta/ThetaSize.html for more details.
-      return value.compact(false, null).toByteArray();
+      // The serializer should respect existing ordering to enable "early stop"
+      // optimisations on unions.
+      boolean shouldCompact = !value.isCompact();
+      boolean shouldOrder = value.isOrdered();
+      if (shouldCompact) {

Review Comment:
   No - this is not the case and it depends on how the ingested sketches have been processed by the end user.  It might be the case that this is true for the empty and singleton sketch (sketch of a single hashed item), but not for others.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408414659


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/customobject/ThetaSketchAccumulator.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.customobject;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import javax.annotation.Nonnull;
+import org.apache.datasketches.theta.SetOperationBuilder;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+
+
+/**
+ * Intermediate state used by {@code DistinctCountThetaSketchAggregationFunction} which gives
+ * the end user more control over how sketches are merged for performance.
+ * The end user can set parameters that trade-off more memory usage for more pre-aggregation.
+ * This permits use of the Union "early-stop" optimisation where ordered sketches require no further
+ * processing beyond the minimum Theta value.
+ * The union operation initialises an empty "gadget" bookkeeping sketch that is updated with hashed entries
+ * that fall below the minimum Theta value for all input sketches ("Broder Rule").  When the initial
+ * Theta value is set to the minimum immediately, further gains can be realised.
+ */
+public class ThetaSketchAccumulator {
+  private ArrayList<Sketch> _accumulator;
+  private boolean _ordered = false;
+  private SetOperationBuilder _setOperationBuilder = new SetOperationBuilder();
+  private Union _union;
+  private int _threshold;
+  private int _numInputs = 0;
+
+  public ThetaSketchAccumulator() {
+  }
+
+  // Note: The accumulator is serialized as a sketch.  This means that the majority of the processing
+  // happens on serialization. Therefore, when deserialized, the values may be null and will
+  // require re-initialisation. Since the primary use case is at query time for the Broker
+  // and Server, these properties are already in memory and are re-set.
+  public ThetaSketchAccumulator(SetOperationBuilder setOperationBuilder, boolean ordered, int threshold) {
+    _setOperationBuilder = setOperationBuilder;
+    _ordered = ordered;
+    _threshold = threshold;
+  }
+
+  public void setOrdered(boolean ordered) {
+    _ordered = ordered;
+  }
+
+  public void setSetOperationBuilder(SetOperationBuilder setOperationBuilder) {
+    _setOperationBuilder = setOperationBuilder;
+  }
+
+  public void setThreshold(int threshold) {
+    _threshold = threshold;
+  }
+
+  public boolean isEmpty() {
+    return _numInputs == 0;
+  }
+
+  @Nonnull
+  public Sketch getResult() {
+    return unionAll();
+  }
+
+  public void apply(Sketch sketch) {
+    internalAdd(sketch);
+  }
+
+  public void merge(ThetaSketchAccumulator thetaUnion) {
+    if (thetaUnion.isEmpty()) {
+      return;
+    }
+    Sketch sketch = thetaUnion.getResult();
+    internalAdd(sketch);
+  }
+
+  private void internalAdd(Sketch sketch) {
+    if (sketch.isEmpty()) {
+      return;
+    }
+    if (_accumulator == null) {
+      _accumulator = new ArrayList<>(_threshold);
+    }
+    _accumulator.add(sketch);
+    _numInputs += 1;
+
+    if (_accumulator.size() >= _threshold) {
+      unionAll();
+    }
+  }
+
+  private Sketch unionAll() {
+    if (_union == null) {
+      _union = _setOperationBuilder.buildUnion();
+    }
+    // Return the default update "gadget" sketch as a compact sketch
+    if (isEmpty()) {
+      return _union.getResult(_ordered, null);
+    }
+    // Corner-case: the parameters are not strictly respected when there is a single sketch.
+    // This single sketch might have been the result of a previously accumulated union and
+    // would already have the parameters set.  The sketch is returned as-is without adjusting
+    // ordering and nominal entries which requires an additional union operation.
+    if (_numInputs == 1) {
+      return _accumulator.get(0);
+    }
+
+    // Performance optimisation: ensure that the minimum Theta is used for "early-stop".
+    _accumulator.sort(Comparator.comparingDouble(Sketch::getTheta));

Review Comment:
   Internally, an compact and ordered Theta sketch can be compared to an array of K items, where K is often in the order of thousands.
   
   To merge several sketches, the union operator will retain the items from each array need to be de-duplicated and stored where the item is less than the threshold, Theta.  If the Theta is chosen to be the minimum of those arrays, it means that it only has to process up to Min Theta for each sketch in the merge, thus reducing de-duplication costs and potentially thousands of hash collision checks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408420454


##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountRawThetaSketchAggregationFunction.java:
##########
@@ -47,11 +49,18 @@ public ColumnDataType getFinalResultColumnType() {
   }
 
   @Override
-  public String extractFinalResult(List<Sketch> sketches) {
-    Sketch sketch = evaluatePostAggregationExpression(sketches);
+  public String extractFinalResult(List<ThetaSketchAccumulator> accumulators) {
+    int numAccumulators = accumulators.size();
+    List<Sketch> mergedSketches = new ArrayList<>(numAccumulators);
 
-    // NOTE: Compact the sketch in unsorted, on-heap fashion for performance concern.
-    //       See https://datasketches.apache.org/docs/Theta/ThetaSize.html for more details.
-    return Base64.getEncoder().encodeToString(sketch.compact(false, null).toByteArray());
+    for (ThetaSketchAccumulator accumulator : accumulators) {
+      accumulator.setOrdered(_intermediateOrdering);
+      accumulator.setThreshold(_accumulatorThreshold);
+      accumulator.setSetOperationBuilder(_setOperationBuilder);
+      mergedSketches.add(accumulator.getResult());

Review Comment:
   Yes, this is the default behaviour - I believe that this matches the existing implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408423857


##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountThetaSketchAggregationFunction.java:
##########
@@ -102,9 +111,22 @@ public DistinctCountThetaSketchAggregationFunction(List<ExpressionContext> argum
       Preconditions.checkArgument(paramsExpression.getType() == ExpressionContext.Type.LITERAL,
           "Second argument of DISTINCT_COUNT_THETA_SKETCH aggregation function must be literal (parameters)");
       Parameters parameters = new Parameters(paramsExpression.getLiteral().getStringValue());
+      // Allows the user to trade-off memory usage for merge CPU; higher values use more memory
+      _accumulatorThreshold = parameters.getAccumulatorThreshold();
+      // Ordering controls whether intermediate compact sketches are ordered in set operations
+      _intermediateOrdering = parameters.getIntermediateOrdering();
+      // Nominal entries controls sketch accuracy and size
       int nominalEntries = parameters.getNominalEntries();
       _updateSketchBuilder.setNominalEntries(nominalEntries);
       _setOperationBuilder.setNominalEntries(nominalEntries);
+      // Sampling probability sets the initial value of Theta, defaults to 1.0

Review Comment:
   This sampling probability is useful to end users who wish to build sketches on the fly from raw values or from existing sketches.   These users might wish to trim down the size of degenerate sketches (sketches that are below capacity).  There are certain use cases (such as ordering by top ranked items) where the user might not care if the tail of the list is negatively impacted in terms of error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1839141345

   @snleee thank you for your review - I'll wait for your opinion on [this comment](https://github.com/apache/pinot/pull/12042#discussion_r1414168534) before making any changes, if any.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1414233789


##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java:
##########
@@ -98,9 +98,9 @@ public static class Helix {
     public static final int DEFAULT_HYPERLOGLOG_PLUS_P = 14;
     public static final int DEFAULT_HYPERLOGLOG_PLUS_SP = 0;
 
-    // 2 to the power of 16, for tradeoffs see datasketches library documentation:
+    // 2 to the power of 14, for tradeoffs see datasketches library documentation:
     // https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html
-    public static final int DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES = 65536;
+    public static final int DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES = 16384;

Review Comment:
   No, unfortunately this is currently not configurable to my knowledge.  This smaller value more closely resembles the default that we select when querying the theta sketch which is 4096.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1414251890


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/customobject/ThetaSketchAccumulator.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.customobject;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import javax.annotation.Nonnull;
+import org.apache.datasketches.theta.SetOperationBuilder;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+
+
+/**
+ * Intermediate state used by {@code DistinctCountThetaSketchAggregationFunction} which gives
+ * the end user more control over how sketches are merged for performance.
+ * The end user can set parameters that trade-off more memory usage for more pre-aggregation.
+ * This permits use of the Union "early-stop" optimisation where ordered sketches require no further
+ * processing beyond the minimum Theta value.
+ * The union operation initialises an empty "gadget" bookkeeping sketch that is updated with hashed entries
+ * that fall below the minimum Theta value for all input sketches ("Broder Rule").  When the initial
+ * Theta value is set to the minimum immediately, further gains can be realised.
+ */
+public class ThetaSketchAccumulator {
+  private ArrayList<Sketch> _accumulator;
+  private boolean _ordered = false;
+  private SetOperationBuilder _setOperationBuilder = new SetOperationBuilder();
+  private Union _union;
+  private int _threshold;
+  private int _numInputs = 0;
+
+  public ThetaSketchAccumulator() {
+  }
+
+  // Note: The accumulator is serialized as a sketch.  This means that the majority of the processing
+  // happens on serialization. Therefore, when deserialized, the values may be null and will
+  // require re-initialisation. Since the primary use case is at query time for the Broker
+  // and Server, these properties are already in memory and are re-set.
+  public ThetaSketchAccumulator(SetOperationBuilder setOperationBuilder, boolean ordered, int threshold) {
+    _setOperationBuilder = setOperationBuilder;
+    _ordered = ordered;
+    _threshold = threshold;
+  }
+
+  public void setOrdered(boolean ordered) {
+    _ordered = ordered;
+  }
+
+  public void setSetOperationBuilder(SetOperationBuilder setOperationBuilder) {
+    _setOperationBuilder = setOperationBuilder;
+  }
+
+  public void setThreshold(int threshold) {
+    _threshold = threshold;
+  }
+
+  public boolean isEmpty() {
+    return _numInputs == 0;
+  }
+
+  @Nonnull
+  public Sketch getResult() {
+    return unionAll();
+  }
+
+  public void apply(Sketch sketch) {
+    internalAdd(sketch);
+  }
+
+  public void merge(ThetaSketchAccumulator thetaUnion) {
+    if (thetaUnion.isEmpty()) {
+      return;
+    }
+    Sketch sketch = thetaUnion.getResult();
+    internalAdd(sketch);
+  }
+
+  private void internalAdd(Sketch sketch) {
+    if (sketch.isEmpty()) {
+      return;
+    }
+    if (_accumulator == null) {
+      _accumulator = new ArrayList<>(_threshold);
+    }
+    _accumulator.add(sketch);
+    _numInputs += 1;
+
+    if (_accumulator.size() >= _threshold) {
+      unionAll();
+    }
+  }
+
+  private Sketch unionAll() {
+    if (_union == null) {
+      _union = _setOperationBuilder.buildUnion();
+    }
+    // Return the default update "gadget" sketch as a compact sketch
+    if (isEmpty()) {
+      return _union.getResult(_ordered, null);
+    }
+    // Corner-case: the parameters are not strictly respected when there is a single sketch.
+    // This single sketch might have been the result of a previously accumulated union and
+    // would already have the parameters set.  The sketch is returned as-is without adjusting
+    // ordering and nominal entries which requires an additional union operation.
+    if (_numInputs == 1) {
+      return _accumulator.get(0);
+    }
+
+    // Performance optimisation: ensure that the minimum Theta is used for "early-stop".
+    _accumulator.sort(Comparator.comparingDouble(Sketch::getTheta));

Review Comment:
   Done - see https://github.com/apache/pinot/pull/12042/commits/90d4562f6a4e7e84cb77fc3f5afb6ffc4e592e7f.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "swaminathanmanish (via GitHub)" <gi...@apache.org>.
swaminathanmanish commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408231135


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/CustomSerDeUtils.java:
##########
@@ -243,9 +243,14 @@ public TDigest deserialize(ByteBuffer byteBuffer) {
 
     @Override
     public byte[] serialize(Sketch value) {
-      // NOTE: Compact the sketch in unsorted, on-heap fashion for performance concern.
-      //       See https://datasketches.apache.org/docs/Theta/ThetaSize.html for more details.
-      return value.compact(false, null).toByteArray();
+      // The serializer should respect existing ordering to enable "early stop"
+      // optimisations on unions.
+      boolean shouldCompact = !value.isCompact();
+      boolean shouldOrder = value.isOrdered();
+      if (shouldCompact) {

Review Comment:
   I assume shouldCompact defaults to true?



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/customobject/ThetaSketchAccumulator.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.customobject;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import javax.annotation.Nonnull;
+import org.apache.datasketches.theta.SetOperationBuilder;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+
+
+/**
+ * Intermediate state used by {@code DistinctCountThetaSketchAggregationFunction} which gives
+ * the end user more control over how sketches are merged for performance.
+ * The end user can set parameters that trade-off more memory usage for more pre-aggregation.
+ * This permits use of the Union "early-stop" optimisation where ordered sketches require no further
+ * processing beyond the minimum Theta value.
+ * The union operation initialises an empty "gadget" bookkeeping sketch that is updated with hashed entries
+ * that fall below the minimum Theta value for all input sketches ("Broder Rule").  When the initial
+ * Theta value is set to the minimum immediately, further gains can be realised.
+ */
+public class ThetaSketchAccumulator {
+  private ArrayList<Sketch> _accumulator;
+  private boolean _ordered = false;
+  private SetOperationBuilder _setOperationBuilder = new SetOperationBuilder();
+  private Union _union;
+  private int _threshold;
+  private int _numInputs = 0;
+
+  public ThetaSketchAccumulator() {
+  }
+
+  // Note: The accumulator is serialized as a sketch.  This means that the majority of the processing
+  // happens on serialization. Therefore, when deserialized, the values may be null and will
+  // require re-initialisation. Since the primary use case is at query time for the Broker
+  // and Server, these properties are already in memory and are re-set.
+  public ThetaSketchAccumulator(SetOperationBuilder setOperationBuilder, boolean ordered, int threshold) {
+    _setOperationBuilder = setOperationBuilder;
+    _ordered = ordered;
+    _threshold = threshold;
+  }
+
+  public void setOrdered(boolean ordered) {
+    _ordered = ordered;
+  }
+
+  public void setSetOperationBuilder(SetOperationBuilder setOperationBuilder) {
+    _setOperationBuilder = setOperationBuilder;
+  }
+
+  public void setThreshold(int threshold) {
+    _threshold = threshold;
+  }
+
+  public boolean isEmpty() {
+    return _numInputs == 0;
+  }
+
+  @Nonnull
+  public Sketch getResult() {
+    return unionAll();
+  }
+
+  public void apply(Sketch sketch) {
+    internalAdd(sketch);
+  }
+
+  public void merge(ThetaSketchAccumulator thetaUnion) {
+    if (thetaUnion.isEmpty()) {
+      return;
+    }
+    Sketch sketch = thetaUnion.getResult();
+    internalAdd(sketch);
+  }
+
+  private void internalAdd(Sketch sketch) {
+    if (sketch.isEmpty()) {
+      return;
+    }
+    if (_accumulator == null) {
+      _accumulator = new ArrayList<>(_threshold);
+    }
+    _accumulator.add(sketch);
+    _numInputs += 1;
+
+    if (_accumulator.size() >= _threshold) {
+      unionAll();
+    }
+  }
+
+  private Sketch unionAll() {
+    if (_union == null) {
+      _union = _setOperationBuilder.buildUnion();
+    }
+    // Return the default update "gadget" sketch as a compact sketch
+    if (isEmpty()) {
+      return _union.getResult(_ordered, null);
+    }
+    // Corner-case: the parameters are not strictly respected when there is a single sketch.
+    // This single sketch might have been the result of a previously accumulated union and
+    // would already have the parameters set.  The sketch is returned as-is without adjusting
+    // ordering and nominal entries which requires an additional union operation.
+    if (_numInputs == 1) {
+      return _accumulator.get(0);
+    }
+
+    // Performance optimisation: ensure that the minimum Theta is used for "early-stop".
+    _accumulator.sort(Comparator.comparingDouble(Sketch::getTheta));

Review Comment:
   Optimization here is to stop the sort based on theta threshold ?  Just trying to understand how early stop takes effect here.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/customobject/ThetaSketchAccumulator.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.customobject;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import javax.annotation.Nonnull;
+import org.apache.datasketches.theta.SetOperationBuilder;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+
+
+/**
+ * Intermediate state used by {@code DistinctCountThetaSketchAggregationFunction} which gives
+ * the end user more control over how sketches are merged for performance.
+ * The end user can set parameters that trade-off more memory usage for more pre-aggregation.
+ * This permits use of the Union "early-stop" optimisation where ordered sketches require no further
+ * processing beyond the minimum Theta value.
+ * The union operation initialises an empty "gadget" bookkeeping sketch that is updated with hashed entries
+ * that fall below the minimum Theta value for all input sketches ("Broder Rule").  When the initial
+ * Theta value is set to the minimum immediately, further gains can be realised.
+ */
+public class ThetaSketchAccumulator {
+  private ArrayList<Sketch> _accumulator;
+  private boolean _ordered = false;
+  private SetOperationBuilder _setOperationBuilder = new SetOperationBuilder();
+  private Union _union;
+  private int _threshold;
+  private int _numInputs = 0;
+
+  public ThetaSketchAccumulator() {
+  }
+
+  // Note: The accumulator is serialized as a sketch.  This means that the majority of the processing
+  // happens on serialization. Therefore, when deserialized, the values may be null and will
+  // require re-initialisation. Since the primary use case is at query time for the Broker
+  // and Server, these properties are already in memory and are re-set.
+  public ThetaSketchAccumulator(SetOperationBuilder setOperationBuilder, boolean ordered, int threshold) {
+    _setOperationBuilder = setOperationBuilder;
+    _ordered = ordered;
+    _threshold = threshold;
+  }
+
+  public void setOrdered(boolean ordered) {
+    _ordered = ordered;
+  }
+
+  public void setSetOperationBuilder(SetOperationBuilder setOperationBuilder) {
+    _setOperationBuilder = setOperationBuilder;
+  }
+
+  public void setThreshold(int threshold) {
+    _threshold = threshold;
+  }
+
+  public boolean isEmpty() {
+    return _numInputs == 0;
+  }
+
+  @Nonnull
+  public Sketch getResult() {
+    return unionAll();
+  }
+
+  public void apply(Sketch sketch) {
+    internalAdd(sketch);
+  }
+
+  public void merge(ThetaSketchAccumulator thetaUnion) {
+    if (thetaUnion.isEmpty()) {
+      return;
+    }
+    Sketch sketch = thetaUnion.getResult();
+    internalAdd(sketch);
+  }
+
+  private void internalAdd(Sketch sketch) {
+    if (sketch.isEmpty()) {
+      return;
+    }
+    if (_accumulator == null) {
+      _accumulator = new ArrayList<>(_threshold);
+    }
+    _accumulator.add(sketch);
+    _numInputs += 1;
+
+    if (_accumulator.size() >= _threshold) {

Review Comment:
   The main optimization here is to batch the sketch unions instead of doing many unions, to amortize the sort-merge cost, right? And thats where we tune this threshold parameter.



##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -261,6 +261,11 @@ public static Sketch thetaSketchDiff(Object sketchObjectA, Object sketchObjectB)
     return diff.getResult(false, null, false);
   }
 
+  @ScalarFunction(names = {"thetaSketchToString", "theta_sketch_to_string"})

Review Comment:
   Are these new functions added to take in the new parameters?



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/customobject/ThetaSketchAccumulator.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.customobject;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import javax.annotation.Nonnull;
+import org.apache.datasketches.theta.SetOperationBuilder;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+
+
+/**
+ * Intermediate state used by {@code DistinctCountThetaSketchAggregationFunction} which gives
+ * the end user more control over how sketches are merged for performance.
+ * The end user can set parameters that trade-off more memory usage for more pre-aggregation.
+ * This permits use of the Union "early-stop" optimisation where ordered sketches require no further
+ * processing beyond the minimum Theta value.
+ * The union operation initialises an empty "gadget" bookkeeping sketch that is updated with hashed entries
+ * that fall below the minimum Theta value for all input sketches ("Broder Rule").  When the initial
+ * Theta value is set to the minimum immediately, further gains can be realised.
+ */
+public class ThetaSketchAccumulator {
+  private ArrayList<Sketch> _accumulator;
+  private boolean _ordered = false;
+  private SetOperationBuilder _setOperationBuilder = new SetOperationBuilder();
+  private Union _union;
+  private int _threshold;
+  private int _numInputs = 0;
+
+  public ThetaSketchAccumulator() {
+  }
+
+  // Note: The accumulator is serialized as a sketch.  This means that the majority of the processing
+  // happens on serialization. Therefore, when deserialized, the values may be null and will
+  // require re-initialisation. Since the primary use case is at query time for the Broker
+  // and Server, these properties are already in memory and are re-set.
+  public ThetaSketchAccumulator(SetOperationBuilder setOperationBuilder, boolean ordered, int threshold) {
+    _setOperationBuilder = setOperationBuilder;
+    _ordered = ordered;
+    _threshold = threshold;
+  }
+
+  public void setOrdered(boolean ordered) {
+    _ordered = ordered;
+  }
+
+  public void setSetOperationBuilder(SetOperationBuilder setOperationBuilder) {
+    _setOperationBuilder = setOperationBuilder;
+  }
+
+  public void setThreshold(int threshold) {
+    _threshold = threshold;
+  }
+
+  public boolean isEmpty() {
+    return _numInputs == 0;
+  }
+
+  @Nonnull
+  public Sketch getResult() {
+    return unionAll();
+  }
+
+  public void apply(Sketch sketch) {
+    internalAdd(sketch);
+  }
+
+  public void merge(ThetaSketchAccumulator thetaUnion) {
+    if (thetaUnion.isEmpty()) {
+      return;
+    }
+    Sketch sketch = thetaUnion.getResult();
+    internalAdd(sketch);
+  }
+
+  private void internalAdd(Sketch sketch) {
+    if (sketch.isEmpty()) {
+      return;
+    }
+    if (_accumulator == null) {
+      _accumulator = new ArrayList<>(_threshold);
+    }
+    _accumulator.add(sketch);
+    _numInputs += 1;
+
+    if (_accumulator.size() >= _threshold) {

Review Comment:
   is there a final catch-all to do the union (last batch), even if the threshold is not met? I guess it happens via 
   
     public Sketch getResult() {
       return unionAll();
     }



##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountThetaSketchAggregationFunction.java:
##########
@@ -102,9 +111,22 @@ public DistinctCountThetaSketchAggregationFunction(List<ExpressionContext> argum
       Preconditions.checkArgument(paramsExpression.getType() == ExpressionContext.Type.LITERAL,
           "Second argument of DISTINCT_COUNT_THETA_SKETCH aggregation function must be literal (parameters)");
       Parameters parameters = new Parameters(paramsExpression.getLiteral().getStringValue());
+      // Allows the user to trade-off memory usage for merge CPU; higher values use more memory
+      _accumulatorThreshold = parameters.getAccumulatorThreshold();
+      // Ordering controls whether intermediate compact sketches are ordered in set operations
+      _intermediateOrdering = parameters.getIntermediateOrdering();
+      // Nominal entries controls sketch accuracy and size
       int nominalEntries = parameters.getNominalEntries();
       _updateSketchBuilder.setNominalEntries(nominalEntries);
       _setOperationBuilder.setNominalEntries(nominalEntries);
+      // Sampling probability sets the initial value of Theta, defaults to 1.0
+      float p = parameters.getSamplingProbability();
+      _setOperationBuilder.setP(p);
+      _updateSketchBuilder.setP(p);
+      // Resize factor controls the size multiple that affects how fast the internal cache grows
+      ResizeFactor rf = parameters.getResizeFactor();

Review Comment:
   Curious how this is tuned and impacts performance ? 



##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountRawThetaSketchAggregationFunction.java:
##########
@@ -47,11 +49,18 @@ public ColumnDataType getFinalResultColumnType() {
   }
 
   @Override
-  public String extractFinalResult(List<Sketch> sketches) {
-    Sketch sketch = evaluatePostAggregationExpression(sketches);
+  public String extractFinalResult(List<ThetaSketchAccumulator> accumulators) {
+    int numAccumulators = accumulators.size();
+    List<Sketch> mergedSketches = new ArrayList<>(numAccumulators);
 
-    // NOTE: Compact the sketch in unsorted, on-heap fashion for performance concern.
-    //       See https://datasketches.apache.org/docs/Theta/ThetaSize.html for more details.
-    return Base64.getEncoder().encodeToString(sketch.compact(false, null).toByteArray());
+    for (ThetaSketchAccumulator accumulator : accumulators) {
+      accumulator.setOrdered(_intermediateOrdering);
+      accumulator.setThreshold(_accumulatorThreshold);
+      accumulator.setSetOperationBuilder(_setOperationBuilder);
+      mergedSketches.add(accumulator.getResult());

Review Comment:
   What would happen in the default case ? I assume it would just do union a for every sketch pair, instead of the intermediate batching (default threshold = 2).



##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountThetaSketchAggregationFunction.java:
##########
@@ -102,9 +111,22 @@ public DistinctCountThetaSketchAggregationFunction(List<ExpressionContext> argum
       Preconditions.checkArgument(paramsExpression.getType() == ExpressionContext.Type.LITERAL,
           "Second argument of DISTINCT_COUNT_THETA_SKETCH aggregation function must be literal (parameters)");
       Parameters parameters = new Parameters(paramsExpression.getLiteral().getStringValue());
+      // Allows the user to trade-off memory usage for merge CPU; higher values use more memory
+      _accumulatorThreshold = parameters.getAccumulatorThreshold();
+      // Ordering controls whether intermediate compact sketches are ordered in set operations
+      _intermediateOrdering = parameters.getIntermediateOrdering();
+      // Nominal entries controls sketch accuracy and size
       int nominalEntries = parameters.getNominalEntries();
       _updateSketchBuilder.setNominalEntries(nominalEntries);
       _setOperationBuilder.setNominalEntries(nominalEntries);
+      // Sampling probability sets the initial value of Theta, defaults to 1.0

Review Comment:
   Is this sampling to skip entries in a sketch, while doing the sort/merge? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee merged PR #12042:
URL: https://github.com/apache/pinot/pull/12042


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1900222025

   Hi @jackjlli,
   Apologies for blocking your deployment - I am looking at a fix to introduce backward compatibility in the protocol messages between server and broker.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1900702821

   @snleee @swaminathanmanish I have created the following draft PR to address the problem reported by @jackjlli:
   https://github.com/apache/pinot/pull/12288
   
   Please let me know what you think.  I successfully reproduced the problem in a local cluster and corrected it with the PR fix.  I have assumed that this only applies to the Combine phase (merge) function as per:
   https://docs.pinot.apache.org/developers/developers-and-contributors/extending-pinot/custom-aggregation-function


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408425307


##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountThetaSketchAggregationFunction.java:
##########
@@ -102,9 +111,22 @@ public DistinctCountThetaSketchAggregationFunction(List<ExpressionContext> argum
       Preconditions.checkArgument(paramsExpression.getType() == ExpressionContext.Type.LITERAL,
           "Second argument of DISTINCT_COUNT_THETA_SKETCH aggregation function must be literal (parameters)");
       Parameters parameters = new Parameters(paramsExpression.getLiteral().getStringValue());
+      // Allows the user to trade-off memory usage for merge CPU; higher values use more memory
+      _accumulatorThreshold = parameters.getAccumulatorThreshold();
+      // Ordering controls whether intermediate compact sketches are ordered in set operations
+      _intermediateOrdering = parameters.getIntermediateOrdering();
+      // Nominal entries controls sketch accuracy and size
       int nominalEntries = parameters.getNominalEntries();
       _updateSketchBuilder.setNominalEntries(nominalEntries);
       _setOperationBuilder.setNominalEntries(nominalEntries);
+      // Sampling probability sets the initial value of Theta, defaults to 1.0
+      float p = parameters.getSamplingProbability();
+      _setOperationBuilder.setP(p);
+      _updateSketchBuilder.setP(p);
+      // Resize factor controls the size multiple that affects how fast the internal cache grows
+      ResizeFactor rf = parameters.getResizeFactor();

Review Comment:
   The resize factor trades off memory consumption for performance.  Sometimes, the internal hash table of the sketch reaches capacity and must be resized which is an expensive operation.  All existing keys need to rehashed.  
   Using a different resize factor can control how large the initial hash table is on construction, thereby trading off the cost of a resize operation with potential over-allocation of memory on heap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1414931382


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/CustomSerDeUtils.java:
##########
@@ -243,9 +243,14 @@ public TDigest deserialize(ByteBuffer byteBuffer) {
 
     @Override
     public byte[] serialize(Sketch value) {
-      // NOTE: Compact the sketch in unsorted, on-heap fashion for performance concern.
-      //       See https://datasketches.apache.org/docs/Theta/ThetaSize.html for more details.
-      return value.compact(false, null).toByteArray();
+      // The serializer should respect existing ordering to enable "early stop"
+      // optimisations on unions.
+      boolean shouldCompact = !value.isCompact();
+      boolean shouldOrder = value.isOrdered();

Review Comment:
   I was simply commenting that
   ```
   boolean shouldOrder = value.isOrdered();
   if (shouldCompact) {
     return value.compact(shouldOrder, null).toByteArray();
   }
   ```
   is the same as
   ```
   if (shouldCompact) {
     return value.compact(value.isOrdered(), null).toByteArray();
   }
   ```
   Since `shouldCompact` was not being used in this function. If `value.isOrdered()` is provided by the end user to indicate that we should sort, we can keep it as is. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1414167043


##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java:
##########
@@ -98,9 +98,9 @@ public static class Helix {
     public static final int DEFAULT_HYPERLOGLOG_PLUS_P = 14;
     public static final int DEFAULT_HYPERLOGLOG_PLUS_SP = 0;
 
-    // 2 to the power of 16, for tradeoffs see datasketches library documentation:
+    // 2 to the power of 14, for tradeoffs see datasketches library documentation:
     // https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html
-    public static final int DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES = 65536;
+    public static final int DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES = 16384;

Review Comment:
   Is this configurable? 



##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -261,6 +261,11 @@ public static Sketch thetaSketchDiff(Object sketchObjectA, Object sketchObjectB)
     return diff.getResult(false, null, false);
   }
 
+  @ScalarFunction(names = {"thetaSketchToString", "theta_sketch_to_string"})
+  public static String thetaSketchToString(Object sketchObject) {

Review Comment:
   What do we return for this function? Can we add some examples for commit notes?



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/customobject/ThetaSketchAccumulator.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.customobject;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import javax.annotation.Nonnull;
+import org.apache.datasketches.theta.SetOperationBuilder;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+
+
+/**
+ * Intermediate state used by {@code DistinctCountThetaSketchAggregationFunction} which gives
+ * the end user more control over how sketches are merged for performance.
+ * The end user can set parameters that trade-off more memory usage for more pre-aggregation.
+ * This permits use of the Union "early-stop" optimisation where ordered sketches require no further
+ * processing beyond the minimum Theta value.
+ * The union operation initialises an empty "gadget" bookkeeping sketch that is updated with hashed entries
+ * that fall below the minimum Theta value for all input sketches ("Broder Rule").  When the initial
+ * Theta value is set to the minimum immediately, further gains can be realised.
+ */
+public class ThetaSketchAccumulator {
+  private ArrayList<Sketch> _accumulator;
+  private boolean _ordered = false;
+  private SetOperationBuilder _setOperationBuilder = new SetOperationBuilder();
+  private Union _union;
+  private int _threshold;
+  private int _numInputs = 0;
+
+  public ThetaSketchAccumulator() {
+  }
+
+  // Note: The accumulator is serialized as a sketch.  This means that the majority of the processing
+  // happens on serialization. Therefore, when deserialized, the values may be null and will
+  // require re-initialisation. Since the primary use case is at query time for the Broker
+  // and Server, these properties are already in memory and are re-set.
+  public ThetaSketchAccumulator(SetOperationBuilder setOperationBuilder, boolean ordered, int threshold) {
+    _setOperationBuilder = setOperationBuilder;
+    _ordered = ordered;
+    _threshold = threshold;
+  }
+
+  public void setOrdered(boolean ordered) {
+    _ordered = ordered;
+  }
+
+  public void setSetOperationBuilder(SetOperationBuilder setOperationBuilder) {
+    _setOperationBuilder = setOperationBuilder;
+  }
+
+  public void setThreshold(int threshold) {
+    _threshold = threshold;
+  }
+
+  public boolean isEmpty() {
+    return _numInputs == 0;
+  }
+
+  @Nonnull
+  public Sketch getResult() {
+    return unionAll();
+  }
+
+  public void apply(Sketch sketch) {
+    internalAdd(sketch);
+  }
+
+  public void merge(ThetaSketchAccumulator thetaUnion) {
+    if (thetaUnion.isEmpty()) {
+      return;
+    }
+    Sketch sketch = thetaUnion.getResult();
+    internalAdd(sketch);
+  }
+
+  private void internalAdd(Sketch sketch) {
+    if (sketch.isEmpty()) {
+      return;
+    }
+    if (_accumulator == null) {
+      _accumulator = new ArrayList<>(_threshold);
+    }
+    _accumulator.add(sketch);
+    _numInputs += 1;
+
+    if (_accumulator.size() >= _threshold) {
+      unionAll();
+    }
+  }
+
+  private Sketch unionAll() {
+    if (_union == null) {
+      _union = _setOperationBuilder.buildUnion();
+    }
+    // Return the default update "gadget" sketch as a compact sketch
+    if (isEmpty()) {
+      return _union.getResult(_ordered, null);
+    }
+    // Corner-case: the parameters are not strictly respected when there is a single sketch.
+    // This single sketch might have been the result of a previously accumulated union and
+    // would already have the parameters set.  The sketch is returned as-is without adjusting
+    // ordering and nominal entries which requires an additional union operation.
+    if (_numInputs == 1) {
+      return _accumulator.get(0);
+    }
+
+    // Performance optimisation: ensure that the minimum Theta is used for "early-stop".
+    _accumulator.sort(Comparator.comparingDouble(Sketch::getTheta));

Review Comment:
   I also have the question as I don't see the code for early terminating the merge. I would assume that this is happening within `Union` or in some sketch side code the sorted property of sketches will trigger the early stop?
   
   I think that it will be great if we can capture the above comment to the comment in the code for providing more insights on this optimization.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/CustomSerDeUtils.java:
##########
@@ -243,9 +243,14 @@ public TDigest deserialize(ByteBuffer byteBuffer) {
 
     @Override
     public byte[] serialize(Sketch value) {
-      // NOTE: Compact the sketch in unsorted, on-heap fashion for performance concern.
-      //       See https://datasketches.apache.org/docs/Theta/ThetaSize.html for more details.
-      return value.compact(false, null).toByteArray();
+      // The serializer should respect existing ordering to enable "early stop"
+      // optimisations on unions.
+      boolean shouldCompact = !value.isCompact();
+      boolean shouldOrder = value.isOrdered();

Review Comment:
   Shouldn't we order if the value is not ordered?
   `boolean shouldOrder = value.isOrdered() -> !value.isOrdered()`?
   
   Current code just passes `shouldOrder` to compact(). In that case, we can just pass `compact(value.isOrdered(), null)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408421227


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -261,6 +261,11 @@ public static Sketch thetaSketchDiff(Object sketchObjectA, Object sketchObjectB)
     return diff.getResult(false, null, false);
   }
 
+  @ScalarFunction(names = {"thetaSketchToString", "theta_sketch_to_string"})

Review Comment:
   No, these are useful for debugging purposes however.  They can be used to inspect the preamble flags for sorting and compaction for sketches in storage, or, for sketches that are returned by the aggregation function.



##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -261,6 +261,11 @@ public static Sketch thetaSketchDiff(Object sketchObjectA, Object sketchObjectB)
     return diff.getResult(false, null, false);
   }
 
+  @ScalarFunction(names = {"thetaSketchToString", "theta_sketch_to_string"})

Review Comment:
   No, these are useful for debugging purposes.  They can be used to inspect the preamble flags for sorting and compaction for sketches in storage, or, for sketches that are returned by the aggregation function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1408414659


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/customobject/ThetaSketchAccumulator.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.customobject;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import javax.annotation.Nonnull;
+import org.apache.datasketches.theta.SetOperationBuilder;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+
+
+/**
+ * Intermediate state used by {@code DistinctCountThetaSketchAggregationFunction} which gives
+ * the end user more control over how sketches are merged for performance.
+ * The end user can set parameters that trade-off more memory usage for more pre-aggregation.
+ * This permits use of the Union "early-stop" optimisation where ordered sketches require no further
+ * processing beyond the minimum Theta value.
+ * The union operation initialises an empty "gadget" bookkeeping sketch that is updated with hashed entries
+ * that fall below the minimum Theta value for all input sketches ("Broder Rule").  When the initial
+ * Theta value is set to the minimum immediately, further gains can be realised.
+ */
+public class ThetaSketchAccumulator {
+  private ArrayList<Sketch> _accumulator;
+  private boolean _ordered = false;
+  private SetOperationBuilder _setOperationBuilder = new SetOperationBuilder();
+  private Union _union;
+  private int _threshold;
+  private int _numInputs = 0;
+
+  public ThetaSketchAccumulator() {
+  }
+
+  // Note: The accumulator is serialized as a sketch.  This means that the majority of the processing
+  // happens on serialization. Therefore, when deserialized, the values may be null and will
+  // require re-initialisation. Since the primary use case is at query time for the Broker
+  // and Server, these properties are already in memory and are re-set.
+  public ThetaSketchAccumulator(SetOperationBuilder setOperationBuilder, boolean ordered, int threshold) {
+    _setOperationBuilder = setOperationBuilder;
+    _ordered = ordered;
+    _threshold = threshold;
+  }
+
+  public void setOrdered(boolean ordered) {
+    _ordered = ordered;
+  }
+
+  public void setSetOperationBuilder(SetOperationBuilder setOperationBuilder) {
+    _setOperationBuilder = setOperationBuilder;
+  }
+
+  public void setThreshold(int threshold) {
+    _threshold = threshold;
+  }
+
+  public boolean isEmpty() {
+    return _numInputs == 0;
+  }
+
+  @Nonnull
+  public Sketch getResult() {
+    return unionAll();
+  }
+
+  public void apply(Sketch sketch) {
+    internalAdd(sketch);
+  }
+
+  public void merge(ThetaSketchAccumulator thetaUnion) {
+    if (thetaUnion.isEmpty()) {
+      return;
+    }
+    Sketch sketch = thetaUnion.getResult();
+    internalAdd(sketch);
+  }
+
+  private void internalAdd(Sketch sketch) {
+    if (sketch.isEmpty()) {
+      return;
+    }
+    if (_accumulator == null) {
+      _accumulator = new ArrayList<>(_threshold);
+    }
+    _accumulator.add(sketch);
+    _numInputs += 1;
+
+    if (_accumulator.size() >= _threshold) {
+      unionAll();
+    }
+  }
+
+  private Sketch unionAll() {
+    if (_union == null) {
+      _union = _setOperationBuilder.buildUnion();
+    }
+    // Return the default update "gadget" sketch as a compact sketch
+    if (isEmpty()) {
+      return _union.getResult(_ordered, null);
+    }
+    // Corner-case: the parameters are not strictly respected when there is a single sketch.
+    // This single sketch might have been the result of a previously accumulated union and
+    // would already have the parameters set.  The sketch is returned as-is without adjusting
+    // ordering and nominal entries which requires an additional union operation.
+    if (_numInputs == 1) {
+      return _accumulator.get(0);
+    }
+
+    // Performance optimisation: ensure that the minimum Theta is used for "early-stop".
+    _accumulator.sort(Comparator.comparingDouble(Sketch::getTheta));

Review Comment:
   Internally, an compact and ordered Theta sketch can be compared to a sorted array of K items, where K is often in the order of thousands.
   
   To merge several sketches, the union operator will retain the items from each array which then need to be de-duplicated and stored where the item is less than the threshold, Theta.  If the Theta is chosen to be the minimum of those arrays, it means that it only has to process up to Min Theta for each sketch in the merge, thus reducing de-duplication costs and potentially thousands of hash collision checks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "swaminathanmanish (via GitHub)" <gi...@apache.org>.
swaminathanmanish commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1839183839

   > > Thanks for the detailed description and the reasoning behind this approach. Intuitively this approach make sense. I took a first pass of the PR and have some high level questions
   > > 
   > > 1. There are 4 new params introduced. Have we quantified the gains for each of these params and which one yields the largest gains? Im assuming these params work independent of each other.
   > > 2. We have nominal entries param for a sketch (which is the number of entries in a sketch?). Curious if we have already experimented tuning this param to figure out the gains ?  How can this parameter impact performance.
   > > 3. Could you share the benchmark results/numbers for different values of the params
   > 
   > Thank you for your review @swaminathanmanish. I found your questions very insightful. I should make it clear at the outset that it has been difficult to thoroughly benchmark the work in this pull request in a test environment. Some of the performance improvements are knobs to turn and are speculative in how they could behave.
   > 
   > Therefore I would like to propose that the parameters are retained and tested in a production environment, and then subsequently removed should they show no user benefit. The downside to this approach is that backward compatibility will not be maintained should end users start to depend on them and use them. As for your questions, answers follow inline.
   > 
   > > 1. There are 4 new params introduced. Have we quantified the gains for each of these params and which one yields the largest gains? Im assuming these params work independent of each other.
   > 
   > The largest gain can be realised through adjusting the sampling probability parameter. However, this is highly dependent on the use case and should only be used in certain circumstances. All parameters are independent of each other and have been selected to have default behaviour that retains the existing behaviour of the system. The performance gains measured in testing are between 25% and 50% performance improvement. Where the results are sampled, the speed increases by 300%.
   > 
   > > 2. We have nominal entries param for a sketch (which is the number of entries in a sketch?). Curious if we have already experimented tuning this param to figure out the gains ?  How can this parameter impact performance.
   > 
   > We have been using this extensively, but it does not help where there is a large tail of sketches that have retained items less than nominal entries. For these cases, selecting lower nominal entries can impact the error across the board. Instead, using sampling probability allows the end user to curtail size and increase error on the long tail, which, for some reports and queries, is a good tradeoff.
   > 
   > > 3. Could you share the benchmark results/numbers for different values of the params
   > 
   > I'd like to do more testing in a production environment. But, the default set of parameters can be up to 25% faster than the current implementation (through sorting and retaining unions). However, adjusting others such as sampling has shown orders of magnitude gains - but, there are certain tradeoffs on accuracy that need to be considered.
   
   Thanks for the detailed response @davecromberge 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1415143704


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/CustomSerDeUtils.java:
##########
@@ -243,9 +243,14 @@ public TDigest deserialize(ByteBuffer byteBuffer) {
 
     @Override
     public byte[] serialize(Sketch value) {
-      // NOTE: Compact the sketch in unsorted, on-heap fashion for performance concern.
-      //       See https://datasketches.apache.org/docs/Theta/ThetaSize.html for more details.
-      return value.compact(false, null).toByteArray();
+      // The serializer should respect existing ordering to enable "early stop"
+      // optimisations on unions.
+      boolean shouldCompact = !value.isCompact();
+      boolean shouldOrder = value.isOrdered();

Review Comment:
   Yes, the ordering is set by the end user.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Theta Sketch Aggregation Enhancements [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #12042:
URL: https://github.com/apache/pinot/pull/12042#discussion_r1414232324


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -261,6 +261,11 @@ public static Sketch thetaSketchDiff(Object sketchObjectA, Object sketchObjectB)
     return diff.getResult(false, null, false);
   }
 
+  @ScalarFunction(names = {"thetaSketchToString", "theta_sketch_to_string"})
+  public static String thetaSketchToString(Object sketchObject) {

Review Comment:
   This function returns a string which is a summary of the current sketch state.  Note, the function name is consistent with the same name as that in Druid.  See [this link](https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta#sketch-summary).
   
   Examples include:
   
   ```
   ### HeapCompactOrderedSketch SUMMARY: 
      Estimate                : 149586.73149344584
      Upper Bound, 95% conf   : 154287.5017892762
      Lower Bound, 95% conf   : 145028.6046846571
      Theta (double)          : 0.027382107751846067
      Theta (long)            : 252555366948521403
      Theta (long) hex        : 038141c4a515c5bb
      EstMode?                : true
      Empty?                  : false
      Array Size Entries      : 4096
      Retained Entries        : 4096
      Seed Hash               : 93cc
   ### END SKETCH SUMMARY
   ```
   and:
   ```
   ### HeapCompactOrderedSketch SUMMARY: 
      Estimate                : 48249.113729035394
      Upper Bound, 95% conf   : 50358.736970106176
      Lower Bound, 95% conf   : 46227.35737896924
      Theta (double)          : 0.04377282475820978
      Theta (long)            : 403733047849016500
      Theta (long) hex        : 059a591165205cb4
      EstMode?                : true
      Empty?                  : false
      Array Size Entries      : 2112
      Retained Entries        : 2112
      Seed Hash               : 93cc
   ### END SKETCH SUMMARY
   ```
   
   These allow the end user to inspect whether a sketch is in estimation mode, or whether the sketch is ordered, or even the state of the sketch - whether undateable or compact.
   
   Where is the best place to include these examples for the commit notes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org