You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "andimiller (via GitHub)" <gi...@apache.org> on 2023/02/27 15:26:11 UTC

[GitHub] [pinot] andimiller opened a new pull request, #10347: Add Sketch Creation Scalar Functions for HLL/Theta

andimiller opened a new pull request, #10347:
URL: https://github.com/apache/pinot/pull/10347

   This enables creation of `BYTES` columns containing Theta/HLL sketches during ingestion.
   
   Note I've added these in `pinot-core` as I didn't know if I should add the datasketches/clearspring dependencies to `pinot-common`
   
   Can be activated like so:
   
   ```json
   {
     "transformConfigs": [
       {
         "columnName": "players",
         "transformFunction": "DistinctCountRawThetaSketch(playerID)"
       },
       {
         "columnName": "names",
         "transformFunction": "DistinctCountRawHLL(playerName)"
       }
     ]
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mayankshriv commented on pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "mayankshriv (via GitHub)" <gi...@apache.org>.
mayankshriv commented on PR #10347:
URL: https://github.com/apache/pinot/pull/10347#issuecomment-1447321571

   Thanks for your contribution @andimiller. One thing to share from past experience is that creating a sketch/hll per value may bloat data size. Nonetheless, does seem like a nice feature to have.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on PR #10347:
URL: https://github.com/apache/pinot/pull/10347#issuecomment-1446716812

   > thank you for the contribution. could we add some tests?
   
   tests with 100% coverage have been added


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119006797


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+            .build();
+    if (input instanceof String) {
+      sketch.update((String) input);
+    } else if (input instanceof Long) {
+      sketch.update((Long) input);
+    } else if (input instanceof Integer) {
+      sketch.update((Integer) input);
+    } else if (input instanceof BigDecimal) {
+      sketch.update(((BigDecimal) input).toString());
+    } else if (input instanceof Float) {
+      sketch.update((Float) input);
+    } else if (input instanceof Double) {
+      sketch.update((Double) input);
+    } else if (input instanceof byte[]) {
+      sketch.update((byte[]) input);
+    }
+    return ObjectSerDeUtils.DATA_SKETCH_SER_DE.serialize(sketch.compact());

Review Comment:
   the intention of this PR is indeed to make one result per row, which can then be rolled up using #10328



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on PR #10347:
URL: https://github.com/apache/pinot/pull/10347#issuecomment-1446914313

   Creating one sketch per row seems a bit wasteful. If the sole purpose is to enable rollup aggregation, can we implement this logic in the `aggregate` method (https://github.com/apache/pinot/pull/10328) by doing some type checking?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119344492


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * Note these will just make sketches that contain a single item, these are intended to be used during ingestion to
+ * create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ *
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID, 1024)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName, 8)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {

Review Comment:
   renamed to `toThetaSketch` and `toHLL`



##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * Note these will just make sketches that contain a single item, these are intended to be used during ingestion to
+ * create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ *
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID, 1024)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName, 8)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)

Review Comment:
   added



##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * Note these will just make sketches that contain a single item, these are intended to be used during ingestion to
+ * create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ *
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID, 1024)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName, 8)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    return distinctCountRawThetaSketch(input, CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES);
+  }
+
+  /**
+   * Create a Theta Sketch containing the input, with a configured nominal entries
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @param nominalEntries number of nominal entries the sketch is configured to keep
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input, int nominalEntries) {
+    UpdateSketch sketch = Sketches.updateSketchBuilder().setNominalEntries(nominalEntries).build();
+    if (input instanceof String) {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119014136


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)

Review Comment:
   btw, is this entry event designed to be configurable? if we are always only adding 1 value to the sketch (same as HLL)
   
   - this means we pre-determine the storage/accuracy trade-off during ingestion yes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119072554


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)

Review Comment:
   hmm, because these configs can be changed/modified after the table is created, it could lead to the error/exception you mentioned above. However, we still want to make this configurable per-column basis rather than relying on a table or cluster-wide setting.
   
   several "solutions" here:
   1. we can throw --> when merging 2 incompatible sketches we will not allow.
       - e.g. we only allow merge for the data ingested with the same config.
   2. we can create table config validator to not allow changes to the sketches config afterwards
       - e.g. changes to this config after table creation is not allowed
   3. we can regenerate all the derived columns when this config changes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119042127


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)

Review Comment:
   also done, split so they can be configured, and updated examples of how to use it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119032910


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)

Review Comment:
   1. yes, this will be pre-determining the storage/accuracy trade-off during ingestion
   2. generally your `log2m` and `nominalEntries` should be the same throughout the system, since merging two sketches with different values will introduce a larger error margin, I believe this HyperLogLog library will throw an exception at merge time, but the Theta library will allow the merge



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119076777


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)

Review Comment:
   1 is probably the most ideal solution here
   
   I'm not sure 3 is viable because the usual reason you'd be generating sketches is to throw away the original data and compact rows down to save space, so when changing the value you'd have to backfill the data



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119044434


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {

Review Comment:
   I'm not sure if the project contains a recommended memoization library anywhere, or if that would be a good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119076777


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)

Review Comment:
   1 is probably the most ideal solution here
   
   I'm not sure 3 is viable because the usual reason you'd be generating sketches is to throw away the original data and compact rows down to save space



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on PR #10347:
URL: https://github.com/apache/pinot/pull/10347#issuecomment-1447112086

    
   > I can't find a mechanism that would allow changing the column type from eg STRING to BYTES during aggregation
   
   This prompts me to think the real problem to solve is to allow `agg(CAST(strCol AS BYTES))`, can I assume this change in ingestion is no longer needed if we have support for the above?
   
   > 
   > I guess the ideal would be to have different column types between REALTIME and OFFLINE and swap to sketches when moving between tables
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119324503


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * Note these will just make sketches that contain a single item, these are intended to be used during ingestion to
+ * create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ *
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID, 1024)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName, 8)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {

Review Comment:
   We cannot use the same name as the aggregation function as it will cause name conflict. Suggest renaming it to `toThetaSketch()`



##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * Note these will just make sketches that contain a single item, these are intended to be used during ingestion to
+ * create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ *
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID, 1024)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName, 8)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)

Review Comment:
   Do we allow null input here? If so, let's annotate the input as `@Nullable`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119336035


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * Note these will just make sketches that contain a single item, these are intended to be used during ingestion to
+ * create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ *
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID, 1024)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName, 8)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {

Review Comment:
   it does actually work with the same name, but agreed it would be clearer with a different name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on PR #10347:
URL: https://github.com/apache/pinot/pull/10347#issuecomment-1447122486

   > > I can't find a mechanism that would allow changing the column type from eg STRING to BYTES during aggregation
   > 
   > This prompts me to think the real problem to solve is to allow `agg(CAST(strCol AS BYTES))`, can I assume this change in ingestion is no longer needed if we have support for the above?
   
   To clarify, I meant during rollup aggregations, where we turn multiple rows into 1, because sketches are usually used to compact rows
   
   It is still desirable to create sketches during ingestion because it reduces the size of data stored in Pinot and removes the requirement to preprocess data to create sketches and roll it up before pushing to Pinot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang merged pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang merged PR #10347:
URL: https://github.com/apache/pinot/pull/10347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119014136


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)

Review Comment:
   btw, is this entry event designed to be configurable? if we are always only adding 1 value to the sketch (same as HLL)
   
   - this means we pre-determine the storage/accuracy trade-off during ingestion yes? 
   - what happen if some time down the line we changed the nominal entries value and some previously ingested data was on a different configuration? will the merge/rollup still work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119041194


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {

Review Comment:
   if they'd like to change the `log2m` or `nominalEntries` caching would get a bit awkward



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119006797


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+            .build();
+    if (input instanceof String) {
+      sketch.update((String) input);
+    } else if (input instanceof Long) {
+      sketch.update((Long) input);
+    } else if (input instanceof Integer) {
+      sketch.update((Integer) input);
+    } else if (input instanceof BigDecimal) {
+      sketch.update(((BigDecimal) input).toString());
+    } else if (input instanceof Float) {
+      sketch.update((Float) input);
+    } else if (input instanceof Double) {
+      sketch.update((Double) input);
+    } else if (input instanceof byte[]) {
+      sketch.update((byte[]) input);
+    }
+    return ObjectSerDeUtils.DATA_SKETCH_SER_DE.serialize(sketch.compact());

Review Comment:
   the intention of this PR is indeed to make one result per row, which can then be rolled up using #10328 or ingestion aggregations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1118974664


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {

Review Comment:
   should you check null for input and create a constant empty sketch?



##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)

Review Comment:
   suggest make nominal entries configurable by adding 
   ```
     @ScalarFunction(nullableParameters = true)
     public static byte[] distinctCountRawThetaSketch(Object input) {
       distinctCountRawThetaSketch(input, DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES);
     }
     
     @ScalarFunction(nullableParameters = true)
     public static byte[] distinctCountRawThetaSketch(Object input, int nominal_entries) { 
       ..
     }
   ```
   



##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)
+            .build();
+    if (input instanceof String) {
+      sketch.update((String) input);
+    } else if (input instanceof Long) {
+      sketch.update((Long) input);
+    } else if (input instanceof Integer) {
+      sketch.update((Integer) input);
+    } else if (input instanceof BigDecimal) {
+      sketch.update(((BigDecimal) input).toString());
+    } else if (input instanceof Float) {
+      sketch.update((Float) input);
+    } else if (input instanceof Double) {
+      sketch.update((Double) input);
+    } else if (input instanceof byte[]) {
+      sketch.update((byte[]) input);
+    }
+    return ObjectSerDeUtils.DATA_SKETCH_SER_DE.serialize(sketch.compact());

Review Comment:
   i am no expert in hll or theta sketch algo. but isn't this always create one result per row? is there really a need for "update"? or that's the intention of this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #10347:
URL: https://github.com/apache/pinot/pull/10347#issuecomment-1446665537

   # [Codecov](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#10347](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (40214fb) into [master](https://codecov.io/gh/apache/pinot/commit/69d2fae2b4c8388a9ec7a559deb07ea2e62d8f26?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (69d2fae) will **increase** coverage by `38.19%`.
   > The diff coverage is `0.00%`.
   
   ```diff
   @@              Coverage Diff              @@
   ##             master   #10347       +/-   ##
   =============================================
   + Coverage     32.06%   70.25%   +38.19%     
   - Complexity      236     5901     +5665     
   =============================================
     Files          2027     2028        +1     
     Lines        109954   109975       +21     
     Branches      16711    16719        +8     
   =============================================
   + Hits          35256    77263    +42007     
   + Misses        71552    27289    -44263     
   - Partials       3146     5423     +2277     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `24.53% <0.00%> (+0.16%)` | :arrow_up: |
   | integration2 | `24.54% <0.00%> (?)` | |
   | unittests1 | `67.67% <0.00%> (?)` | |
   | unittests2 | `13.75% <0.00%> (-0.02%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...he/pinot/core/function/scalar/SketchFunctions.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9mdW5jdGlvbi9zY2FsYXIvU2tldGNoRnVuY3Rpb25zLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...core/startree/operator/StarTreeFilterOperator.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9zdGFydHJlZS9vcGVyYXRvci9TdGFyVHJlZUZpbHRlck9wZXJhdG9yLmphdmE=) | `86.18% <0.00%> (-0.66%)` | :arrow_down: |
   | [...pache/pinot/core/query/utils/idset/EmptyIdSet.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS91dGlscy9pZHNldC9FbXB0eUlkU2V0LmphdmE=) | `25.00% <0.00%> (ø)` | |
   | [...anager/realtime/SegmentBuildTimeLeaseExtender.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9kYXRhL21hbmFnZXIvcmVhbHRpbWUvU2VnbWVudEJ1aWxkVGltZUxlYXNlRXh0ZW5kZXIuamF2YQ==) | `63.23% <0.00%> (ø)` | |
   | [.../core/realtime/PinotLLCRealtimeSegmentManager.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci9oZWxpeC9jb3JlL3JlYWx0aW1lL1Bpbm90TExDUmVhbHRpbWVTZWdtZW50TWFuYWdlci5qYXZh) | `75.41% <0.00%> (+0.45%)` | :arrow_up: |
   | [...lix/core/realtime/PinotRealtimeSegmentManager.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci9oZWxpeC9jb3JlL3JlYWx0aW1lL1Bpbm90UmVhbHRpbWVTZWdtZW50TWFuYWdlci5qYXZh) | `79.50% <0.00%> (+0.50%)` | :arrow_up: |
   | [...che/pinot/broker/routing/BrokerRoutingManager.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9Ccm9rZXJSb3V0aW5nTWFuYWdlci5qYXZh) | `85.79% <0.00%> (+0.55%)` | :arrow_up: |
   | [...ces/PinotSegmentUploadDownloadRestletResource.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci9hcGkvcmVzb3VyY2VzL1Bpbm90U2VnbWVudFVwbG9hZERvd25sb2FkUmVzdGxldFJlc291cmNlLmphdmE=) | `54.77% <0.00%> (+0.82%)` | :arrow_up: |
   | [...e/pinot/common/function/TransformFunctionType.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vZnVuY3Rpb24vVHJhbnNmb3JtRnVuY3Rpb25UeXBlLmphdmE=) | `100.00% <0.00%> (+0.94%)` | :arrow_up: |
   | [.../helix/core/realtime/SegmentCompletionManager.java](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci9oZWxpeC9jb3JlL3JlYWx0aW1lL1NlZ21lbnRDb21wbGV0aW9uTWFuYWdlci5qYXZh) | `73.17% <0.00%> (+1.01%)` | :arrow_up: |
   | ... and [1258 more](https://codecov.io/gh/apache/pinot/pull/10347?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119339764


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * Note these will just make sketches that contain a single item, these are intended to be used during ingestion to
+ * create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ *
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID, 1024)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName, 8)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)

Review Comment:
   yes because even null values need to be turned into an empty sketch, I will add the annotations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119072554


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * These are intended to be used during ingestion to create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    // TODO make nominal entries configurable
+    UpdateSketch sketch =
+        Sketches.updateSketchBuilder().setNominalEntries(CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES)

Review Comment:
   hmm, because these configs can be changed/modified after the table is created, it could lead to the error/exception you mentioned above. However, we still want to make this configurable per-column basis rather than relying on a table or cluster-wide setting.
   
   several "solutions" here:
   1. we can throw --> when merging 2 incompatible sketches we will not allow.
       - e.g. we only allow merge for the data ingested with the same config.
   2. we can create table config validator to not allow changes to the sketches config afterwards
       - e.g. changes to this config after table creation is not allowed
   3. we can regenerate all the derived columns when this config changes. 
   
   ^ BTW the above is not related to this PR, it is more of a follow up we need to think about 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] andimiller commented on pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "andimiller (via GitHub)" <gi...@apache.org>.
andimiller commented on PR #10347:
URL: https://github.com/apache/pinot/pull/10347#issuecomment-1446918974

   > Creating one sketch per row seems a bit wasteful. If the sole purpose is to enable rollup aggregation, can we implement this logic in the `aggregate` method (#10328) by doing some type checking?
   
   I can't find a mechanism that would allow changing the column type from eg STRING to BYTES during aggregation 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #10347: Add Sketch Creation Scalar Functions for HLL/Theta

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #10347:
URL: https://github.com/apache/pinot/pull/10347#discussion_r1119326294


##########
pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java:
##########
@@ -0,0 +1,133 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.function.scalar;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import java.math.BigDecimal;
+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.pinot.core.common.ObjectSerDeUtils;
+import org.apache.pinot.spi.annotations.ScalarFunction;
+import org.apache.pinot.spi.utils.CommonConstants;
+
+
+/**
+ * Inbuilt Sketch Transformation Functions
+ * The functions can be used as UDFs in Query when added in the FunctionRegistry.
+ * @ScalarFunction annotation is used with each method for the registration
+ *
+ * Note these will just make sketches that contain a single item, these are intended to be used during ingestion to
+ * create sketches from raw data, which can be rolled up later.
+ *
+ * Note this is defined in pinot-core rather than pinot-common because pinot-core has dependencies on
+ * datasketches/clearspring analytics.
+ *
+ * Example usage:
+ *
+ * {
+ *   "transformConfigs": [
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID)"
+ *     },
+ *     {
+ *       "columnName": "players",
+ *       "transformFunction": "DistinctCountRawThetaSketch(playerID, 1024)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName)"
+ *     },
+ *     {
+ *       "columnName": "names",
+ *       "transformFunction": "DistinctCountRawHLL(playerName, 8)"
+ *     }
+ *   ]
+ * }
+ */
+public class SketchFunctions {
+  private SketchFunctions() {
+  }
+
+  /**
+   * Create a Theta Sketch containing the input
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input) {
+    return distinctCountRawThetaSketch(input, CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES);
+  }
+
+  /**
+   * Create a Theta Sketch containing the input, with a configured nominal entries
+   *
+   * @param input an Object we want to insert into the sketch, may be null to return an empty sketch
+   * @param nominalEntries number of nominal entries the sketch is configured to keep
+   * @return serialized theta sketch as bytes
+   */
+  @ScalarFunction(nullableParameters = true)
+  public static byte[] distinctCountRawThetaSketch(Object input, int nominalEntries) {
+    UpdateSketch sketch = Sketches.updateSketchBuilder().setNominalEntries(nominalEntries).build();
+    if (input instanceof String) {

Review Comment:
   (minor) suggest following the same order as the `DataType` definition so that it is easier to track: INT, LONG, FLOAT, DOUBLE, BIG_DECIMAL, STRING, BYTES



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org