You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "andimiller (via GitHub)" <gi...@apache.org> on 2023/02/15 22:18:49 UTC

[GitHub] [pinot] andimiller opened a new pull request, #10288: Add Theta Sketch Aggregation

andimiller opened a new pull request, #10288:
URL: https://github.com/apache/pinot/pull/10288

   # Summary
   
   This implements `ValueAggregator` for the apache datasketches theta `Sketch`.
   
   This should allow `DISTINCT_COUNT_THETA_SKETCH` and `DISTINCT_COUNT_RAW_THETA_SKETCH` to be used in `StarTree` indexes.
   
   # Assumptions:
   * The default number of nominal entries is `2 ^ 16`, I haven't made this configurable because I wasn't sure which mechanism to use
   * the `_maxByteSize` implementation is similar to the `Bitmap` implementation since it can grow a lot
   * unit tests have been added to cover the main methods


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] davecromberge commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1108706772


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,117 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.SingleItemSketch;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    return SingleItemSketch.create(rawValue.hashCode()).compact();
+  }
+
+  // Utility method to merge two sketches
+  private Sketch union(Sketch left, Sketch right) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    Union u = Union.builder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .buildUnion();
+    u.update(left);
+    u.update(right);
+    return u.getResult().compact();
+  }
+
+  @Override
+  public Sketch getInitialAggregatedValue(Object rawValue) {
+    Sketch initialValue;
+    if (rawValue instanceof byte[]) {
+      byte[] bytes = (byte[]) rawValue;
+      initialValue = deserializeAggregatedValue(bytes);
+      _maxByteSize = Math.max(_maxByteSize, bytes.length);
+    } else {
+      initialValue = singleItemSketch(rawValue);
+      _maxByteSize = Math.max(_maxByteSize, initialValue.getCurrentBytes(true));
+    }
+    return initialValue;

Review Comment:
   Fair enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mayankshriv commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "mayankshriv (via GitHub)" <gi...@apache.org>.
mayankshriv commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1108948702


##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java:
##########
@@ -91,6 +91,10 @@ public static class Helix {
     public static final String DEFAULT_HYPERLOGLOG_LOG2M_KEY = "default.hyperloglog.log2m";
     public static final int DEFAULT_HYPERLOGLOG_LOG2M = 8;
 
+    // 2 to the power of 16, for tradeoffs see datasketches library documentation:
+    // https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html
+    public static final int DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES = 65536;

Review Comment:
   We might want to make this configurable to allow for accuracy vs storage tradeoff.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,117 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.SingleItemSketch;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    return SingleItemSketch.create(rawValue.hashCode()).compact();
+  }
+
+  // Utility method to merge two sketches
+  private Sketch union(Sketch left, Sketch right) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    Union u = Union.builder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .buildUnion();
+    u.update(left);
+    u.update(right);
+    return u.getResult().compact();
+  }
+
+  @Override
+  public Sketch getInitialAggregatedValue(Object rawValue) {
+    Sketch initialValue;
+    if (rawValue instanceof byte[]) {
+      byte[] bytes = (byte[]) rawValue;
+      initialValue = deserializeAggregatedValue(bytes);
+      _maxByteSize = Math.max(_maxByteSize, bytes.length);
+    } else {
+      initialValue = singleItemSketch(rawValue);
+      _maxByteSize = Math.max(_maxByteSize, initialValue.getCurrentBytes(true));
+    }
+    return initialValue;

Review Comment:
   Yeah, this is more of a contract than enforcement. 



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,138 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  Union _union;
+
+  public DistinctCountThetaSketchValueAggregator() {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    _union = Union.builder()
+            .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+            .buildUnion();
+  };
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    UpdateSketch sketch = Sketches.updateSketchBuilder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .build();
+    if (rawValue instanceof String) {

Review Comment:
   There are other data types (eg `Boolean`, `JSON`) etc that might need to be handled. Seems like other valueAggregators have the same issue.
   
   cc: @Jackie-Jiang 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on PR #10288:
URL: https://github.com/apache/pinot/pull/10288#issuecomment-1433895884

   found it, once I added a `functionColumnPairs` of `"DISTINCT_COUNT_RAW_THETA_SKETCH__Column` the index accelerates the raw version


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on PR #10288:
URL: https://github.com/apache/pinot/pull/10288#issuecomment-1433839206

   > Thanks for the contribution!
   > 
   > I do have a high level question: do you need to use theta sketch to do single column distinct count? The reason why we didn't add theta sketch support for star-tree index is because HLL is the better data structure to use if no set intersect or diff is required
   
   You're correct, we want to query Theta sketches from the StarTree index and do set operations with them, so I'm hoping this will allow queries like this to hit the StarTree index:
   
   ```sql
   SELECT DistinctCountThetaSketch(
     users,
     '',
     'country ="UK"',
     'device="Browser"',
     'SET_DIFF($1, $2)'
   ) AS british_non_browser_users FROM user_stats
   ```
   
   where `users` is a `BYTES` column with serialized theta sketches in
   
   and if it can't handle that we're hoping we can fall back to:
   
   ```sql
   SELECT 
     DistinctCountRawThetaSketch(users) where country="UK" 
   FROM user_stats
   ```
   ```sql
   SELECT 
     DistinctCountRawThetaSketch(users) where device="Browser"
   FROM user_stats
   ```
   
   then do the diff in a JVM program
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] davecromberge commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1108528024


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,117 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.SingleItemSketch;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    return SingleItemSketch.create(rawValue.hashCode()).compact();
+  }
+
+  // Utility method to merge two sketches
+  private Sketch union(Sketch left, Sketch right) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    Union u = Union.builder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .buildUnion();
+    u.update(left);
+    u.update(right);
+    return u.getResult().compact();

Review Comment:
   **suggestion:** You could also construct a shared union builder up front and apply the stateless operation here:
   
   ```u.union(left, right)```
   
   This would automatically return a compacted sketch as result.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,117 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.SingleItemSketch;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    return SingleItemSketch.create(rawValue.hashCode()).compact();
+  }
+
+  // Utility method to merge two sketches
+  private Sketch union(Sketch left, Sketch right) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    Union u = Union.builder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .buildUnion();
+    u.update(left);
+    u.update(right);
+    return u.getResult().compact();
+  }
+
+  @Override
+  public Sketch getInitialAggregatedValue(Object rawValue) {
+    Sketch initialValue;
+    if (rawValue instanceof byte[]) {
+      byte[] bytes = (byte[]) rawValue;
+      initialValue = deserializeAggregatedValue(bytes);
+      _maxByteSize = Math.max(_maxByteSize, bytes.length);
+    } else {
+      initialValue = singleItemSketch(rawValue);
+      _maxByteSize = Math.max(_maxByteSize, initialValue.getCurrentBytes(true));
+    }
+    return initialValue;

Review Comment:
   **question:** if the end user attempts to build this index on a non-theta serialised byte column, would it blow up?  Would you say there is a potential improvement to be made at a system level in relaying types for these values?



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregatorTest.java:
##########
@@ -0,0 +1,136 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import java.util.stream.IntStream;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.spi.utils.CommonConstants;
+import org.testng.annotations.Test;
+
+import static org.testng.Assert.assertEquals;
+
+public class DistinctCountThetaSketchValueAggregatorTest {
+
+
+    @Test
+    public void initialShouldCreateSingleItemSketch() {
+        DistinctCountThetaSketchValueAggregator agg = new DistinctCountThetaSketchValueAggregator();
+        assertEquals(
+                agg.getInitialAggregatedValue("hello world").getEstimate(),
+                1.0
+        );
+    }
+
+    @Test
+    public void initialShouldParseASketch() {
+        UpdateSketch input = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input::update);
+        Sketch result = input.compact();
+        DistinctCountThetaSketchValueAggregator agg = new DistinctCountThetaSketchValueAggregator();
+        byte[] bytes = agg.serializeAggregatedValue(result);
+        assertEquals(
+                agg.getInitialAggregatedValue(bytes).getEstimate(),
+                result.getEstimate()
+        );
+
+        // and should update the max size
+        assertEquals(
+                agg.getMaxAggregatedValueByteSize(),
+                result.getCurrentBytes(true)
+        );
+    }
+
+    @Test
+    public void applyAggregatedValueShouldUnion() {
+        UpdateSketch input1 = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input1::update);
+        Sketch result1 = input1.compact();
+        UpdateSketch input2 = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input2::update);
+        Sketch result2 = input1.compact();
+        DistinctCountThetaSketchValueAggregator agg = new DistinctCountThetaSketchValueAggregator();
+        Sketch result = agg.applyAggregatedValue(result1, result2);
+        Union union = Union.builder()
+          .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES).buildUnion();
+        union.update(result1);
+        union.update(result2);
+
+        assertEquals(
+                result.getEstimate(),
+                union.getResult().getEstimate()
+        );
+
+        // and should update the max size
+        assertEquals(
+                agg.getMaxAggregatedValueByteSize(),
+                union.getResult().getCurrentBytes(true)
+        );
+    }
+
+    @Test
+    public void applyRawValueShouldUnion() {
+        UpdateSketch input1 = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input1::update);
+        Sketch result1 = input1.compact();
+        UpdateSketch input2 = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input2::update);
+        Sketch result2 = input1.compact();

Review Comment:
   **question:** did you mean `input2.compact()`?



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,117 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.SingleItemSketch;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    return SingleItemSketch.create(rawValue.hashCode()).compact();

Review Comment:
   **nitpick:** there is no need to compact here because the single item sketch is already a compact sketch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang merged pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang merged PR #10288:
URL: https://github.com/apache/pinot/pull/10288


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1108546415


##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregatorTest.java:
##########
@@ -0,0 +1,136 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import java.util.stream.IntStream;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.spi.utils.CommonConstants;
+import org.testng.annotations.Test;
+
+import static org.testng.Assert.assertEquals;
+
+public class DistinctCountThetaSketchValueAggregatorTest {
+
+
+    @Test
+    public void initialShouldCreateSingleItemSketch() {
+        DistinctCountThetaSketchValueAggregator agg = new DistinctCountThetaSketchValueAggregator();
+        assertEquals(
+                agg.getInitialAggregatedValue("hello world").getEstimate(),
+                1.0
+        );
+    }
+
+    @Test
+    public void initialShouldParseASketch() {
+        UpdateSketch input = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input::update);
+        Sketch result = input.compact();
+        DistinctCountThetaSketchValueAggregator agg = new DistinctCountThetaSketchValueAggregator();
+        byte[] bytes = agg.serializeAggregatedValue(result);
+        assertEquals(
+                agg.getInitialAggregatedValue(bytes).getEstimate(),
+                result.getEstimate()
+        );
+
+        // and should update the max size
+        assertEquals(
+                agg.getMaxAggregatedValueByteSize(),
+                result.getCurrentBytes(true)
+        );
+    }
+
+    @Test
+    public void applyAggregatedValueShouldUnion() {
+        UpdateSketch input1 = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input1::update);
+        Sketch result1 = input1.compact();
+        UpdateSketch input2 = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input2::update);
+        Sketch result2 = input1.compact();
+        DistinctCountThetaSketchValueAggregator agg = new DistinctCountThetaSketchValueAggregator();
+        Sketch result = agg.applyAggregatedValue(result1, result2);
+        Union union = Union.builder()
+          .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES).buildUnion();
+        union.update(result1);
+        union.update(result2);
+
+        assertEquals(
+                result.getEstimate(),
+                union.getResult().getEstimate()
+        );
+
+        // and should update the max size
+        assertEquals(
+                agg.getMaxAggregatedValueByteSize(),
+                union.getResult().getCurrentBytes(true)
+        );
+    }
+
+    @Test
+    public void applyRawValueShouldUnion() {
+        UpdateSketch input1 = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input1::update);
+        Sketch result1 = input1.compact();
+        UpdateSketch input2 = Sketches.updateSketchBuilder().build();
+        IntStream.range(0, 1000).forEach(input2::update);
+        Sketch result2 = input1.compact();

Review Comment:
   I did, that is a mistake, good catch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1113804834


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,138 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  Union _union;
+
+  public DistinctCountThetaSketchValueAggregator() {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    _union = Union.builder()
+            .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+            .buildUnion();
+  };
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    UpdateSketch sketch = Sketches.updateSketchBuilder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .build();
+    if (rawValue instanceof String) {

Review Comment:
   Should be okay as other types are stored as the basic types



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1108697064


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,135 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    UpdateSketch sketch = Sketches.updateSketchBuilder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .build();
+    if (rawValue instanceof String) {
+      sketch.update((String) rawValue);
+    } else if (rawValue instanceof Integer) {
+      sketch.update((Integer) rawValue);
+    } else if (rawValue instanceof Long) {
+      sketch.update((Long) rawValue);
+    } else if (rawValue instanceof Double) {
+      sketch.update((Double) rawValue);
+    } else if (rawValue instanceof Float) {
+      sketch.update((Float) rawValue);
+    } else {
+      sketch.update(rawValue.hashCode());
+    }
+    return sketch.compact();
+  }
+
+  // Utility method to merge two sketches
+  private Sketch union(Sketch left, Sketch right) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    Union u = Union.builder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .buildUnion();
+    u.update(left);
+    u.update(right);
+    return u.getResult().compact();
+  }

Review Comment:
   we verified that they're ordered on the updated dependency



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1109635053


##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregatorTest.java:
##########
@@ -0,0 +1,136 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;

Review Comment:
   I tried asking intellij to reformat the project based on the config and it's rewritten 1062 files, I'll try and narrow it down to just files I've touched



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1114358463


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,134 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  Union _union;
+
+  public DistinctCountThetaSketchValueAggregator() {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    _union = Union.builder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES).buildUnion();
+  }
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+            .build();
+    if (rawValue instanceof String) {
+      sketch.update((String) rawValue);
+    } else if (rawValue instanceof Integer) {
+      sketch.update((Integer) rawValue);
+    } else if (rawValue instanceof Long) {
+      sketch.update((Long) rawValue);
+    } else if (rawValue instanceof Double) {
+      sketch.update((Double) rawValue);
+    } else if (rawValue instanceof Float) {
+      sketch.update((Float) rawValue);
+    } else {
+      sketch.update(rawValue.hashCode());

Review Comment:
   I've added MV support and fallen through to an `IllegalStateException`, plus tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1108695746


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,117 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.SingleItemSketch;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    return SingleItemSketch.create(rawValue.hashCode()).compact();
+  }
+
+  // Utility method to merge two sketches
+  private Sketch union(Sketch left, Sketch right) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    Union u = Union.builder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .buildUnion();
+    u.update(left);
+    u.update(right);
+    return u.getResult().compact();

Review Comment:
   I pulled in the updated datasketches and fixed this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] davecromberge commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1108554042


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,135 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    UpdateSketch sketch = Sketches.updateSketchBuilder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .build();
+    if (rawValue instanceof String) {
+      sketch.update((String) rawValue);
+    } else if (rawValue instanceof Integer) {
+      sketch.update((Integer) rawValue);
+    } else if (rawValue instanceof Long) {
+      sketch.update((Long) rawValue);
+    } else if (rawValue instanceof Double) {
+      sketch.update((Double) rawValue);
+    } else if (rawValue instanceof Float) {
+      sketch.update((Float) rawValue);
+    } else {
+      sketch.update(rawValue.hashCode());
+    }
+    return sketch.compact();
+  }
+
+  // Utility method to merge two sketches
+  private Sketch union(Sketch left, Sketch right) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    Union u = Union.builder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .buildUnion();
+    u.update(left);
+    u.update(right);
+    return u.getResult().compact();
+  }

Review Comment:
   **suggestion:** It could be worth ensuring that sketches are ordered.  This would make unions more efficient, and, assuming that they are computed at each level of the tree multiple traversals would not repeatedly incur the cost of aligning hash keys for unordered unions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on PR #10288:
URL: https://github.com/apache/pinot/pull/10288#issuecomment-1433784353

   Thanks for the contribution!
   
   I do have a high level question: do you need to use theta sketch to do single column distinct count? The reason why we didn't add theta sketch support for star-tree index is because HLL is the better data structure to use if no set intersect or diff is required


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] snleee commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1110154082


##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregatorTest.java:
##########
@@ -0,0 +1,136 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;

Review Comment:
   @andimiller Oh you can just apply that to the file that you are touching
   
   you can even highlight the section and run `shft+alt+cmd + L` on mac and the following window will pop up.
   
   ![Screenshot 2023-02-17 at 9 56 49 AM](https://user-images.githubusercontent.com/27253407/219738841-4535c6af-a37f-46ae-8e1d-0f54ca5a6ac5.png)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1113810827


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,134 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  Union _union;

Review Comment:
   (minor)
   ```suggestion
     private final Union _union;
   ```



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,134 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  Union _union;
+
+  public DistinctCountThetaSketchValueAggregator() {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    _union = Union.builder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES).buildUnion();
+  }
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+            .build();
+    if (rawValue instanceof String) {
+      sketch.update((String) rawValue);
+    } else if (rawValue instanceof Integer) {
+      sketch.update((Integer) rawValue);
+    } else if (rawValue instanceof Long) {
+      sketch.update((Long) rawValue);
+    } else if (rawValue instanceof Double) {
+      sketch.update((Double) rawValue);
+    } else if (rawValue instanceof Float) {
+      sketch.update((Float) rawValue);
+    } else {
+      sketch.update(rawValue.hashCode());

Review Comment:
   We should do an `instanceof Object[]` check to handle MV case. If it is not `Object[]`, throw an `IllegalStateException` to indicate that the type is unsupported



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,134 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  Union _union;
+
+  public DistinctCountThetaSketchValueAggregator() {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    _union = Union.builder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES).buildUnion();
+  }
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;

Review Comment:
   (minor)
   Move this in front of the constructor for readability



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on PR #10288:
URL: https://github.com/apache/pinot/pull/10288#issuecomment-1433105164

   the errors look like something is pulling in an old version of `datasketches`, but I'm not sure where


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] snleee commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1108896630


##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregatorTest.java:
##########
@@ -0,0 +1,136 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;

Review Comment:
   Can we apply the formatting for this? https://docs.pinot.apache.org/developers/developers-and-contributors/code-setup#setup-ide



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregatorTest.java:
##########
@@ -0,0 +1,136 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;

Review Comment:
   @Jackie-Jiang This somehow passed our lint test. Can you take a look if we cover `tab` correctly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on PR #10288:
URL: https://github.com/apache/pinot/pull/10288#issuecomment-1433947288

   `DISTINCT_COUNT_THETA_SKETCH` is a very special aggregation function because it can work on multiple columns at once (similar to post aggregation but modeled as aggregation). Star-tree index can be applied only when the aggregation is applied on a single column.
   I can see the value of supporting it for `DISTINCT_COUNT_RAW_THETA_SKETCH`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #10288:
URL: https://github.com/apache/pinot/pull/10288#issuecomment-1433420880

   # [Codecov](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#10288](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (890a04b) into [master](https://codecov.io/gh/apache/pinot/commit/a0162559b5dfbf5ac56c342afd5d576532b3dc3d?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a016255) will **decrease** coverage by `56.69%`.
   > The diff coverage is `0.00%`.
   
   ```diff
   @@              Coverage Diff              @@
   ##             master   #10288       +/-   ##
   =============================================
   - Coverage     70.36%   13.68%   -56.69%     
   + Complexity     5827      182     -5645     
   =============================================
     Files          2016     1963       -53     
     Lines        109635   107220     -2415     
     Branches      16680    16392      -288     
   =============================================
   - Hits          77145    14669    -62476     
   - Misses        27073    91387    +64314     
   + Partials       5417     1164     -4253     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `?` | |
   | unittests2 | `13.68% <0.00%> (-0.02%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...gator/DistinctCountThetaSketchValueAggregator.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9hZ2dyZWdhdG9yL0Rpc3RpbmN0Q291bnRUaGV0YVNrZXRjaFZhbHVlQWdncmVnYXRvci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...gment/local/aggregator/ValueAggregatorFactory.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9hZ2dyZWdhdG9yL1ZhbHVlQWdncmVnYXRvckZhY3RvcnkuamF2YQ==) | `0.00% <0.00%> (-84.62%)` | :arrow_down: |
   | [...va/org/apache/pinot/spi/utils/CommonConstants.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvdXRpbHMvQ29tbW9uQ29uc3RhbnRzLmphdmE=) | `0.00% <ø> (-24.40%)` | :arrow_down: |
   | [...src/main/java/org/apache/pinot/sql/FilterKind.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcWwvRmlsdGVyS2luZC5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...ain/java/org/apache/pinot/core/data/table/Key.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9kYXRhL3RhYmxlL0tleS5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...in/java/org/apache/pinot/spi/utils/BytesUtils.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvdXRpbHMvQnl0ZXNVdGlscy5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...ain/java/org/apache/pinot/common/CustomObject.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vQ3VzdG9tT2JqZWN0LmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...n/java/org/apache/pinot/core/data/table/Table.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9kYXRhL3RhYmxlL1RhYmxlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [.../java/org/apache/pinot/core/data/table/Record.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9kYXRhL3RhYmxlL1JlY29yZC5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [.../java/org/apache/pinot/core/util/GroupByUtils.java](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS91dGlsL0dyb3VwQnlVdGlscy5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | ... and [1588 more](https://codecov.io/gh/apache/pinot/pull/10288?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1114358114


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,134 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.Union;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  Union _union;
+
+  public DistinctCountThetaSketchValueAggregator() {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    _union = Union.builder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES).buildUnion();
+  }
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;

Review Comment:
   moved



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1108696627


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregator.java:
##########
@@ -0,0 +1,117 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;
+
+import org.apache.datasketches.theta.SingleItemSketch;
+import org.apache.datasketches.theta.Sketch;
+import org.apache.datasketches.theta.Union;
+import org.apache.pinot.segment.local.utils.CustomSerDeUtils;
+import org.apache.pinot.segment.spi.AggregationFunctionType;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+public class DistinctCountThetaSketchValueAggregator implements ValueAggregator<Object, Sketch> {
+  public static final DataType AGGREGATED_VALUE_TYPE = DataType.BYTES;
+
+  @Override
+  public AggregationFunctionType getAggregationType() {
+    return AggregationFunctionType.DISTINCTCOUNTTHETASKETCH;
+  }
+
+  @Override
+  public DataType getAggregatedValueType() {
+    return AGGREGATED_VALUE_TYPE;
+  }
+
+  // This changes a lot similar to the Bitmap aggregator
+  private int _maxByteSize;
+
+  // Utility method to create a theta sketch with one item in it
+  private Sketch singleItemSketch(Object rawValue) {
+    return SingleItemSketch.create(rawValue.hashCode()).compact();
+  }
+
+  // Utility method to merge two sketches
+  private Sketch union(Sketch left, Sketch right) {
+    // TODO: Handle configurable nominal entries for StarTreeBuilder
+    Union u = Union.builder()
+      .setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+      .buildUnion();
+    u.update(left);
+    u.update(right);
+    return u.getResult().compact();
+  }
+
+  @Override
+  public Sketch getInitialAggregatedValue(Object rawValue) {
+    Sketch initialValue;
+    if (rawValue instanceof byte[]) {
+      byte[] bytes = (byte[]) rawValue;
+      initialValue = deserializeAggregatedValue(bytes);
+      _maxByteSize = Math.max(_maxByteSize, bytes.length);
+    } else {
+      initialValue = singleItemSketch(rawValue);
+      _maxByteSize = Math.max(_maxByteSize, initialValue.getCurrentBytes(true));
+    }
+    return initialValue;

Review Comment:
   it would, and I've copied how the other aggregations do it, I'm not sure what the right way to deal with this is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on PR #10288:
URL: https://github.com/apache/pinot/pull/10288#issuecomment-1433889181

   I tested it locally and it looks like today the first SQL query doesn't trigger the StarTree index, but the raw ones do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] snleee commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1110154082


##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregatorTest.java:
##########
@@ -0,0 +1,136 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;

Review Comment:
   @andimiller Oh you can just apply that to the file that you are touching
   
   you can even highlight the section and run `shft+alt+cmd + L` on mac and the following window will pop up.
   
   ![Screenshot 2023-02-17 at 9 57 57 AM](https://user-images.githubusercontent.com/27253407/219739667-afbb84e5-8f4d-4cf3-8285-65099823e72d.png)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10288: Add Theta Sketch Aggregation for StarTree Indexes

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10288:
URL: https://github.com/apache/pinot/pull/10288#discussion_r1110168491


##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/aggregator/DistinctCountThetaSketchValueAggregatorTest.java:
##########
@@ -0,0 +1,136 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.aggregator;

Review Comment:
   thanks, I went through earlier and reformatted each file in this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org