You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/05/05 02:04:46 UTC

[GitHub] [druid] gianm opened a new pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

gianm opened a new pull request #11201:
URL: https://github.com/apache/druid/pull/11201


   Builds on the concept from #11172 and adds a way to feed HLL sketches
   with UTF-8 bytes.
   
   This must be an option rather than always-on, because prior to this
   patch, HLL sketches used UTF-16LE encoding when hashing strings. To
   remain compatible with sketch images created prior to this patch -- which
   matters during rolling updates and when reading sketches that have been
   written to segments -- we must keep UTF-16LE as the default.
   
   Not currently documented, because I'm not yet sure how best to expose
   this functionality to users. I think the first place would be in the SQL
   layer: we could have it automatically select UTF-8 or UTF-16LE when
   building sketches at query time. We need to be careful about this, though,
   because UTF-8 isn't always faster. Sometimes, like for the results of
   expressions, UTF-16LE is faster. I expect we will sort this out in
   future patches.
   
   **Benchmarks**
   
   These are benchmarks of `APPROX_COUNT_DISTINCT_DS_HLL(<string column>)` on a 500k row segment. The values that appear are generally short (a few characters long).
   
   ```
   patch - utf8
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt   Score   Error  Units
   SqlBenchmark.querySql        1            500000        false  avgt   15  27.639 ± 1.221  ms/op
   SqlBenchmark.querySql        1            500000        force  avgt   15  25.362 ± 2.406  ms/op
   
   patch - utf16le
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt   Score   Error  Units
   SqlBenchmark.querySql        1            500000        false  avgt   15  40.828 ± 2.855  ms/op
   SqlBenchmark.querySql        1            500000        force  avgt   15  39.257 ± 5.548  ms/op
   
   master
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt   Score   Error  Units
   SqlBenchmark.querySql        1            500000        false  avgt   15  43.704 ± 3.875  ms/op
   SqlBenchmark.querySql        1            500000        force  avgt   15  45.293 ± 3.045  ms/op
   ```
   
   **Testing strategy**
   
   - Extended HllSketchAggregatorTest to run on a matrix of (vectorized, nonvectorized) x (utf8, utf16le).
   - Added specific sketch image expectations to HllSketchAggregatorTest and manually verified them against master, to ensure consistency across versions.
   - Added serde tests to make sure the new field JSON-round-trips properly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-833134173


   This pull request **introduces 1 alert** when merging 0d7d9aad781528b09013c91bc95699f011f314a7 into 34169c8550dbe0a3fc9043aa21d221b9c7502039 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-9a9edf1c6824d39cdea78a09323d5ca30ace44b8)
   
   **new alerts:**
   
   * 1 for Inconsistent equals and hashCode


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-832932241


   This pull request **introduces 1 alert** when merging 492edd34f67378656d64c260e1b46cb8a320ce29 into 34169c8550dbe0a3fc9043aa21d221b9c7502039 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-1209ba4123cc1a9ac6ce24b1b9e3f553c37c56a2)
   
   **new alerts:**
   
   * 1 for Inconsistent equals and hashCode


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #11201:
URL: https://github.com/apache/druid/pull/11201#discussion_r734994198



##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/vector/SingleValueStringHllSketchBuildVectorProcessor.java
##########
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.hll.vector;
+
+import org.apache.datasketches.hll.HllSketch;
+import org.apache.druid.java.util.common.StringEncoding;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildBufferAggregatorHelper;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildUtil;
+import org.apache.druid.segment.vector.SingleValueDimensionVectorSelector;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+public class SingleValueStringHllSketchBuildVectorProcessor implements HllSketchBuildVectorProcessor
+{
+  private final HllSketchBuildBufferAggregatorHelper helper;
+  private final StringEncoding stringEncoding;
+  private final SingleValueDimensionVectorSelector selector;
+
+  public SingleValueStringHllSketchBuildVectorProcessor(
+      final HllSketchBuildBufferAggregatorHelper helper,
+      final StringEncoding stringEncoding,
+      final SingleValueDimensionVectorSelector selector
+  )
+  {
+    this.helper = helper;
+    this.stringEncoding = stringEncoding;
+    this.selector = selector;
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int position, int startRow, int endRow)
+  {
+    final int[] vector = selector.getRowVector();
+    final HllSketch sketch = helper.getSketchAtPosition(buf, position);
+
+    for (int i = startRow; i < endRow; i++) {
+      HllSketchBuildUtil.updateSketchWithDictionarySelector(
+          sketch,
+          stringEncoding,
+          selector,
+          vector[i]
+      );
+    }
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int numRows, int[] positions, @Nullable int[] rows, int positionOffset)
+  {
+    final int[] vector = selector.getRowVector();
+
+    for (int i = 0; i < numRows; i++) {
+      final int position = positions[i] + positionOffset;
+      final HllSketch sketch = helper.getSketchAtPosition(buf, position);
+
+      HllSketchBuildUtil.updateSketchWithDictionarySelector(
+          sketch,
+          stringEncoding,
+          selector,
+          vector[i]

Review comment:
       Fixed and added a test case using filtered aggregators.

##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/vector/ObjectHllSketchBuildVectorProcessor.java
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.hll.vector;
+
+import org.apache.datasketches.hll.HllSketch;
+import org.apache.druid.java.util.common.StringEncoding;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildBufferAggregatorHelper;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildUtil;
+import org.apache.druid.segment.vector.VectorObjectSelector;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+/**
+ * Processor that handles cases where string columns are presented as object selectors instead of dimension selectors.
+ */
+public class ObjectHllSketchBuildVectorProcessor implements HllSketchBuildVectorProcessor
+{
+  private final HllSketchBuildBufferAggregatorHelper helper;
+  private final StringEncoding stringEncoding;
+  private final VectorObjectSelector selector;
+
+  public ObjectHllSketchBuildVectorProcessor(
+      final HllSketchBuildBufferAggregatorHelper helper,
+      final StringEncoding stringEncoding,
+      final VectorObjectSelector selector
+  )
+  {
+    this.helper = helper;
+    this.stringEncoding = stringEncoding;
+    this.selector = selector;
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int position, int startRow, int endRow)
+  {
+    final Object[] vector = selector.getObjectVector();
+    final HllSketch sketch = helper.getSketchAtPosition(buf, position);
+
+    for (int i = startRow; i < endRow; i++) {
+      if (vector[i] != null) {
+        HllSketchBuildUtil.updateSketch(
+            sketch,
+            stringEncoding,
+            vector[i]
+        );
+      }
+    }
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int numRows, int[] positions, @Nullable int[] rows, int positionOffset)
+  {
+    final Object[] vector = selector.getObjectVector();

Review comment:
       Fixed and added a test case using filtered aggregators.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-950240994


   This pull request **introduces 1 alert** when merging f7ff8d4c9e80492bf7b43ecc6cbd15da5cd1efbd into 44a7b091908f8602d28ce4bffe791764e0ee57ad - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-ed5d6dbc9b7849702c702034475bd606ead8fe8a)
   
   **new alerts:**
   
   * 1 for Inconsistent equals and hashCode


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-950188697


   Pushed this stuff:
   
   - fixes for the issues @abhishekagarwal87 found
   - update to use the new API from https://github.com/apache/datasketches-java/pull/353
   - improve test coverage for updateSketch and updateSketchWithDictionarySelector


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] leerho commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
leerho commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-834830853


   @gianm 
   I want to clarify a comment made above.
   
   > HLL sketches used UTF-16LE encoding when hashing strings. 
   
   This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does).  Strings are encoded using UTF-8 and have been for as long as I can remember.  If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well.  The sketch really doesn't care what the string encoding is, it is either looking at the input as a stream of byte[] or char[].   The UTF-8 encoding was specified in the string update method to help users ensure consistency (if the string happened to be encoded in something else).  Nonetheless, whatever you decide, you will **always** need to stick with your choice.  Otherwise, you will destroy the unique identity of whatever you are feeding the sketch. As a result counts, merging, etc will be meaningless!
   
   I have some comments about [PR 353](https://github.com/apache/datasketches-java/pull/353) but I want to make these in the actual PR.
   
   Lee.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-832391515


   This pull request **introduces 1 alert** when merging e27cc1cb3de02dec5dadec0e398b445a7132045f into 99f39c720261923a4f908ad2ae33b03b372989b0 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-acf9bcb2b1ce37a0bd063cd4bbc72d760af276f3)
   
   **new alerts:**
   
   * 1 for Inconsistent equals and hashCode


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-950197682


   This pull request **introduces 1 alert** when merging 53a708fa3f29d517260f954c87288da934dd9222 into 44a7b091908f8602d28ce4bffe791764e0ee57ad - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-4412e2a209b2937b6739c846074477ad13d02eba)
   
   **new alerts:**
   
   * 1 for Inconsistent equals and hashCode


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #11201:
URL: https://github.com/apache/druid/pull/11201#discussion_r637966467



##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/vector/ObjectHllSketchBuildVectorProcessor.java
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.hll.vector;
+
+import org.apache.datasketches.hll.HllSketch;
+import org.apache.druid.java.util.common.StringEncoding;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildBufferAggregatorHelper;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildUtil;
+import org.apache.druid.segment.vector.VectorObjectSelector;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+/**
+ * Processor that handles cases where string columns are presented as object selectors instead of dimension selectors.
+ */
+public class ObjectHllSketchBuildVectorProcessor implements HllSketchBuildVectorProcessor
+{
+  private final HllSketchBuildBufferAggregatorHelper helper;
+  private final StringEncoding stringEncoding;
+  private final VectorObjectSelector selector;
+
+  public ObjectHllSketchBuildVectorProcessor(
+      final HllSketchBuildBufferAggregatorHelper helper,
+      final StringEncoding stringEncoding,
+      final VectorObjectSelector selector
+  )
+  {
+    this.helper = helper;
+    this.stringEncoding = stringEncoding;
+    this.selector = selector;
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int position, int startRow, int endRow)
+  {
+    final Object[] vector = selector.getObjectVector();
+    final HllSketch sketch = helper.getSketchAtPosition(buf, position);
+
+    for (int i = startRow; i < endRow; i++) {
+      if (vector[i] != null) {
+        HllSketchBuildUtil.updateSketch(
+            sketch,
+            stringEncoding,
+            vector[i]
+        );
+      }
+    }
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int numRows, int[] positions, @Nullable int[] rows, int positionOffset)
+  {
+    final Object[] vector = selector.getObjectVector();

Review comment:
       Ah, good catch, I'll figure out why the tests didn't catch it. I think we might need to use a filtered aggregator.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-836992606


   > This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does). Strings are encoded using UTF-8 and have been for as long as I can remember. If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well.
   
   @leerho Understood, but it is true as far as Druid is concerned — the HllSketch-based aggregator implementation in Druid does `update(s.toCharArray())` not `update(s)`: https://github.com/apache/druid/blob/8296123d895db7d06bc4517db5e767afb7862b83/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/HllSketchBuildAggregator.java#L103
   
   >  Nonetheless, whatever you decide, you will always need to stick with your choice.
   
   Yep, that's why this must be an option and the choice needs to be made in a consistent way.
   
   > I have some comments about PR 353 but I want to make these in the actual PR.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #11201:
URL: https://github.com/apache/druid/pull/11201#discussion_r637707374



##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/vector/ObjectHllSketchBuildVectorProcessor.java
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.hll.vector;
+
+import org.apache.datasketches.hll.HllSketch;
+import org.apache.druid.java.util.common.StringEncoding;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildBufferAggregatorHelper;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildUtil;
+import org.apache.druid.segment.vector.VectorObjectSelector;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+/**
+ * Processor that handles cases where string columns are presented as object selectors instead of dimension selectors.
+ */
+public class ObjectHllSketchBuildVectorProcessor implements HllSketchBuildVectorProcessor
+{
+  private final HllSketchBuildBufferAggregatorHelper helper;
+  private final StringEncoding stringEncoding;
+  private final VectorObjectSelector selector;
+
+  public ObjectHllSketchBuildVectorProcessor(
+      final HllSketchBuildBufferAggregatorHelper helper,
+      final StringEncoding stringEncoding,
+      final VectorObjectSelector selector
+  )
+  {
+    this.helper = helper;
+    this.stringEncoding = stringEncoding;
+    this.selector = selector;
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int position, int startRow, int endRow)
+  {
+    final Object[] vector = selector.getObjectVector();
+    final HllSketch sketch = helper.getSketchAtPosition(buf, position);
+
+    for (int i = startRow; i < endRow; i++) {
+      if (vector[i] != null) {
+        HllSketchBuildUtil.updateSketch(
+            sketch,
+            stringEncoding,
+            vector[i]
+        );
+      }
+    }
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int numRows, int[] positions, @Nullable int[] rows, int positionOffset)
+  {
+    final Object[] vector = selector.getObjectVector();

Review comment:
       `rows` is not used in this function for finding the vector offset. 

##########
File path: core/src/main/java/org/apache/druid/java/util/common/StringEncoding.java
##########
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.java.util.common;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonValue;
+import com.google.common.primitives.SignedBytes;
+
+/**
+ * An enum that provides a way for users to specify what encoding should be used when hashing strings.
+ *
+ * The main reason for thsi setting's existence is getting the best performance possible. When operating on memory

Review comment:
       ```suggestion
    * The main reason for this setting's existence is getting the best performance possible. When operating on memory
   ```

##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/vector/SingleValueStringHllSketchBuildVectorProcessor.java
##########
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.hll.vector;
+
+import org.apache.datasketches.hll.HllSketch;
+import org.apache.druid.java.util.common.StringEncoding;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildBufferAggregatorHelper;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildUtil;
+import org.apache.druid.segment.vector.SingleValueDimensionVectorSelector;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+public class SingleValueStringHllSketchBuildVectorProcessor implements HllSketchBuildVectorProcessor
+{
+  private final HllSketchBuildBufferAggregatorHelper helper;
+  private final StringEncoding stringEncoding;
+  private final SingleValueDimensionVectorSelector selector;
+
+  public SingleValueStringHllSketchBuildVectorProcessor(
+      final HllSketchBuildBufferAggregatorHelper helper,
+      final StringEncoding stringEncoding,
+      final SingleValueDimensionVectorSelector selector
+  )
+  {
+    this.helper = helper;
+    this.stringEncoding = stringEncoding;
+    this.selector = selector;
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int position, int startRow, int endRow)
+  {
+    final int[] vector = selector.getRowVector();
+    final HllSketch sketch = helper.getSketchAtPosition(buf, position);
+
+    for (int i = startRow; i < endRow; i++) {
+      HllSketchBuildUtil.updateSketchWithDictionarySelector(
+          sketch,
+          stringEncoding,
+          selector,
+          vector[i]
+      );
+    }
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int numRows, int[] positions, @Nullable int[] rows, int positionOffset)
+  {
+    final int[] vector = selector.getRowVector();
+
+    for (int i = 0; i < numRows; i++) {
+      final int position = positions[i] + positionOffset;
+      final HllSketch sketch = helper.getSketchAtPosition(buf, position);
+
+      HllSketchBuildUtil.updateSketchWithDictionarySelector(
+          sketch,
+          stringEncoding,
+          selector,
+          vector[i]

Review comment:
       similar comment here about `rows`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-966468378


   This pull request **introduces 1 alert** when merging dc373eb9f69c5d985a802d0e7f22f91fd895876c into 223c5692a8292a210a770dc0702f75b4484ae347 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-006ae887114d0a653765fc5bfa030771d39af063)
   
   **new alerts:**
   
   * 1 for Inconsistent equals and hashCode


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #11201:
URL: https://github.com/apache/druid/pull/11201#discussion_r637966916



##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/vector/ObjectHllSketchBuildVectorProcessor.java
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.hll.vector;
+
+import org.apache.datasketches.hll.HllSketch;
+import org.apache.druid.java.util.common.StringEncoding;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildBufferAggregatorHelper;
+import org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildUtil;
+import org.apache.druid.segment.vector.VectorObjectSelector;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+/**
+ * Processor that handles cases where string columns are presented as object selectors instead of dimension selectors.
+ */
+public class ObjectHllSketchBuildVectorProcessor implements HllSketchBuildVectorProcessor
+{
+  private final HllSketchBuildBufferAggregatorHelper helper;
+  private final StringEncoding stringEncoding;
+  private final VectorObjectSelector selector;
+
+  public ObjectHllSketchBuildVectorProcessor(
+      final HllSketchBuildBufferAggregatorHelper helper,
+      final StringEncoding stringEncoding,
+      final VectorObjectSelector selector
+  )
+  {
+    this.helper = helper;
+    this.stringEncoding = stringEncoding;
+    this.selector = selector;
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int position, int startRow, int endRow)
+  {
+    final Object[] vector = selector.getObjectVector();
+    final HllSketch sketch = helper.getSketchAtPosition(buf, position);
+
+    for (int i = startRow; i < endRow; i++) {
+      if (vector[i] != null) {
+        HllSketchBuildUtil.updateSketch(
+            sketch,
+            stringEncoding,
+            vector[i]
+        );
+      }
+    }
+  }
+
+  @Override
+  public void aggregate(ByteBuffer buf, int numRows, int[] positions, @Nullable int[] rows, int positionOffset)
+  {
+    final Object[] vector = selector.getObjectVector();

Review comment:
       This makes me think it'd be good to have a standard aggregator test harness, like we have with BaseFilterTest for filters (that makes sure filters are exercised in all the ways they can be called).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org