You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/02/10 22:32:49 UTC

[GitHub] [pinot] Jackie-Jiang opened a new pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Jackie-Jiang opened a new pull request #8189:
URL: https://github.com/apache/pinot/pull/8189


   ## Description
   Adds `DistinctCountSmartHLLAggregationFunction` which can automatically convert the `Set` to `HyperLogLog` if the set size grows too big to protect the servers from running out of memory. This conversion only applies to aggregation only queries, but not the group-by queries.
   
   By default, when the set size exceeds 100K, it will be converted to a HyperLogLog with log2m of 12.
   The log2m and threshold can be configured using the second argument (literal) of the function:
   - `hllLog2m`: log2m of the converted HyperLogLog (default 12)
   - `hllConversionThreshold`: set size threshold to trigger the conversion, non-positive means never convert (default 100K)
   
   Example query:
   `SELECT DISTINCTCOUNTSMARTHLL(myCol, 'hllLog2m=8;hllConversionThreshold=10') FROM myTable`
   
   ## Release Notes
   Adds `DistinctCountSmartHLLAggregationFunction` which automatically stores distinct values in Set or HyperLogLog based on cardinality


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#issuecomment-1039660887


   @richardstartin I'd go with the set based solution for now to have the same behavior with `distinct_count` for low cardinality case. We can revisit both functions after collecting more info in the future. In the meanwhile, if user knows bitmap based solution performs better, they can use `DistinctCountBitmap` instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang merged pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang merged pull request #8189:
URL: https://github.com/apache/pinot/pull/8189


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
richardstartin commented on pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#issuecomment-1036521820


   > @richardstartin Good suggestion on storing values in a bitmap for better performance and lower memory footprint. Is my understanding correct that in the worst case, for 32 bit values, we will use up to 16 bit per value storing them in a bitmap (not including metadata)? For 64 bit values, does long-bitmap gives better performance for sparse values?
   > 
   > Before hitting the threshold, we do want to keep the 100% accurate result because we want to use this function as a replacement of the current `DISTINCT_COUNT` in certain environments (configurable)
   
   The worst case depends on the size of the set. The absolute worst case is more than 32 bits per value, this would happen if you had 2^16 values with a gap of roughly 2^16 between each value in the set. The worst case for a set more than 2^16 values decreases monotonically. 
   
   If we have to maintain absolute accuracy below the threshold, we can't truncate `double` to `float`, but hopefully users don't want to distinct count `double`s anyway, and it's a meaningless operation given the nature of floating point numbers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#discussion_r804707995



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java
##########
@@ -152,6 +123,50 @@ private double toDouble(Comparable value) {
     }
   }
 
+  private Set getDistinctValueSet(Dictionary dictionary) {
+    int dictionarySize = dictionary.length();
+    switch (dictionary.getValueType()) {
+      case INT:
+        IntOpenHashSet intSet = new IntOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          intSet.add(dictionary.getIntValue(dictId));
+        }
+        return intSet;
+      case LONG:
+        LongOpenHashSet longSet = new LongOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          longSet.add(dictionary.getLongValue(dictId));
+        }
+        return longSet;
+      case FLOAT:
+        FloatOpenHashSet floatSet = new FloatOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          floatSet.add(dictionary.getFloatValue(dictId));
+        }
+        return floatSet;
+      case DOUBLE:
+        DoubleOpenHashSet doubleSet = new DoubleOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          doubleSet.add(dictionary.getDoubleValue(dictId));
+        }
+        return doubleSet;

Review comment:
       Given that this is an approximate algorithm and floating point numbers are _approximations_, I don't see a downside in loss of precision by converting `double` to `float` and then storing the `int` bits in a `RoaringBitmap`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#issuecomment-1036471514


   @richardstartin Good suggestion on storing values in a bitmap for better performance and lower memory footprint. Is my understanding correct that in the worst case, for 32 bit values, we will use up to 16 bit per value storing them in a bitmap (not including metadata)? For 64 bit values, does long-bitmap gives better performance for sparse values?
   
   Before hitting the threshold, we do want to keep the 100% accurate result because we want to use this function as a replacement of the current `DISTINCT_COUNT` in certain environments (configurable)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#discussion_r804660859



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java
##########
@@ -152,6 +123,50 @@ private double toDouble(Comparable value) {
     }
   }
 
+  private Set getDistinctValueSet(Dictionary dictionary) {
+    int dictionarySize = dictionary.length();
+    switch (dictionary.getValueType()) {
+      case INT:
+        IntOpenHashSet intSet = new IntOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          intSet.add(dictionary.getIntValue(dictId));
+        }
+        return intSet;
+      case LONG:
+        LongOpenHashSet longSet = new LongOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          longSet.add(dictionary.getLongValue(dictId));
+        }
+        return longSet;
+      case FLOAT:
+        FloatOpenHashSet floatSet = new FloatOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          floatSet.add(dictionary.getFloatValue(dictId));
+        }
+        return floatSet;
+      case DOUBLE:
+        DoubleOpenHashSet doubleSet = new DoubleOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          doubleSet.add(dictionary.getDoubleValue(dictId));
+        }
+        return doubleSet;

Review comment:
       Could convert to int/long bits and store in a roaringbitmap




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#discussion_r804690496



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java
##########
@@ -152,6 +123,50 @@ private double toDouble(Comparable value) {
     }
   }
 
+  private Set getDistinctValueSet(Dictionary dictionary) {
+    int dictionarySize = dictionary.length();
+    switch (dictionary.getValueType()) {
+      case INT:
+        IntOpenHashSet intSet = new IntOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          intSet.add(dictionary.getIntValue(dictId));
+        }
+        return intSet;
+      case LONG:
+        LongOpenHashSet longSet = new LongOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          longSet.add(dictionary.getLongValue(dictId));
+        }
+        return longSet;
+      case FLOAT:
+        FloatOpenHashSet floatSet = new FloatOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          floatSet.add(dictionary.getFloatValue(dictId));
+        }
+        return floatSet;
+      case DOUBLE:
+        DoubleOpenHashSet doubleSet = new DoubleOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          doubleSet.add(dictionary.getDoubleValue(dictId));
+        }
+        return doubleSet;

Review comment:
       But this doesn't work well for `double` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#issuecomment-1035621423


   # [Codecov](https://codecov.io/gh/apache/pinot/pull/8189?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#8189](https://codecov.io/gh/apache/pinot/pull/8189?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e7c5165) into [master](https://codecov.io/gh/apache/pinot/commit/b12b7bb84d1669253a04c0a0e31f60be92f1ba2d?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b12b7bb) will **decrease** coverage by `57.26%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/pinot/pull/8189/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/pinot/pull/8189?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #8189       +/-   ##
   =============================================
   - Coverage     71.34%   14.08%   -57.27%     
   + Complexity     4307       81     -4226     
   =============================================
     Files          1623     1579       -44     
     Lines         84320    82940     -1380     
     Branches      12642    12569       -73     
   =============================================
   - Hits          60162    11683    -48479     
   - Misses        20028    70388    +50360     
   + Partials       4130      869     -3261     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `?` | |
   | unittests2 | `14.08% <0.00%> (-0.09%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/8189?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...ator/query/DictionaryBasedAggregationOperator.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9vcGVyYXRvci9xdWVyeS9EaWN0aW9uYXJ5QmFzZWRBZ2dyZWdhdGlvbk9wZXJhdG9yLmphdmE=) | `0.00% <0.00%> (-87.31%)` | :arrow_down: |
   | [...rg/apache/pinot/core/plan/AggregationPlanNode.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9wbGFuL0FnZ3JlZ2F0aW9uUGxhbk5vZGUuamF2YQ==) | `0.00% <0.00%> (-91.00%)` | :arrow_down: |
   | [...gregation/function/AggregationFunctionFactory.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9hZ2dyZWdhdGlvbi9mdW5jdGlvbi9BZ2dyZWdhdGlvbkZ1bmN0aW9uRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (-83.34%)` | :arrow_down: |
   | [...tion/DistinctCountSmartHLLAggregationFunction.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9hZ2dyZWdhdGlvbi9mdW5jdGlvbi9EaXN0aW5jdENvdW50U21hcnRITExBZ2dyZWdhdGlvbkZ1bmN0aW9uLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...che/pinot/segment/spi/AggregationFunctionType.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1zcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Bpbm90L3NlZ21lbnQvc3BpL0FnZ3JlZ2F0aW9uRnVuY3Rpb25UeXBlLmphdmE=) | `0.00% <0.00%> (-90.25%)` | :arrow_down: |
   | [...ain/java/org/apache/pinot/core/data/table/Key.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9kYXRhL3RhYmxlL0tleS5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [.../java/org/apache/pinot/spi/utils/BooleanUtils.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvdXRpbHMvQm9vbGVhblV0aWxzLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [.../java/org/apache/pinot/core/data/table/Record.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9kYXRhL3RhYmxlL1JlY29yZC5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [.../java/org/apache/pinot/core/util/GroupByUtils.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS91dGlsL0dyb3VwQnlVdGlscy5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...ava/org/apache/pinot/spi/config/table/FSTType.java](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvY29uZmlnL3RhYmxlL0ZTVFR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | ... and [1299 more](https://codecov.io/gh/apache/pinot/pull/8189/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/8189?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/8189?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [b12b7bb...e7c5165](https://codecov.io/gh/apache/pinot/pull/8189?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang merged pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang merged pull request #8189:
URL: https://github.com/apache/pinot/pull/8189


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#discussion_r804660510



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java
##########
@@ -152,6 +123,50 @@ private double toDouble(Comparable value) {
     }
   }
 
+  private Set getDistinctValueSet(Dictionary dictionary) {
+    int dictionarySize = dictionary.length();
+    switch (dictionary.getValueType()) {
+      case INT:
+        IntOpenHashSet intSet = new IntOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          intSet.add(dictionary.getIntValue(dictId));
+        }
+        return intSet;
+      case LONG:
+        LongOpenHashSet longSet = new LongOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          longSet.add(dictionary.getLongValue(dictId));
+        }
+        return longSet;

Review comment:
       Does this have to be a set, I think a roaringbitmap might be better




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#issuecomment-1039660887


   @richardstartin I'd go with the set based solution for now to have the same behavior with `distinct_count` for low cardinality case. We can revisit both functions after collecting more info in the future. In the meanwhile, if user knows bitmap based solution performs better, they can use `DistinctCountBitmap` instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#discussion_r804659216



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java
##########
@@ -229,16 +235,11 @@ private static boolean isFitForMetadataBasedPlan(AggregationFunction[] aggregati
 
   /**
    * Returns {@code true} if the given aggregations can be solved with dictionary, {@code false} otherwise.
-   * <p>Aggregations supported: MIN, MAX, MIN_MAX_RANGE, DISTINCT_COUNT, SEGMENT_PARTITIONED_DISTINCT_COUNT
    */
   private static boolean isFitForDictionaryBasedPlan(AggregationFunction[] aggregationFunctions,
       IndexSegment indexSegment) {
     for (AggregationFunction aggregationFunction : aggregationFunctions) {
-      AggregationFunctionType functionType = aggregationFunction.getType();
-      if (functionType != AggregationFunctionType.MIN && functionType != AggregationFunctionType.MAX
-          && functionType != AggregationFunctionType.MINMAXRANGE
-          && functionType != AggregationFunctionType.DISTINCTCOUNT
-          && functionType != AggregationFunctionType.SEGMENTPARTITIONEDDISTINCTCOUNT) {
+      if (!DICTIONARY_BASED_FUNCTIONS.contains(aggregationFunction.getType())) {

Review comment:
       +1, I made the same change on a local branch 😁 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#discussion_r804722637



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java
##########
@@ -152,6 +123,50 @@ private double toDouble(Comparable value) {
     }
   }
 
+  private Set getDistinctValueSet(Dictionary dictionary) {
+    int dictionarySize = dictionary.length();
+    switch (dictionary.getValueType()) {
+      case INT:
+        IntOpenHashSet intSet = new IntOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          intSet.add(dictionary.getIntValue(dictId));
+        }
+        return intSet;
+      case LONG:
+        LongOpenHashSet longSet = new LongOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          longSet.add(dictionary.getLongValue(dictId));
+        }
+        return longSet;
+      case FLOAT:
+        FloatOpenHashSet floatSet = new FloatOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          floatSet.add(dictionary.getFloatValue(dictId));
+        }
+        return floatSet;
+      case DOUBLE:
+        DoubleOpenHashSet doubleSet = new DoubleOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          doubleSet.add(dictionary.getDoubleValue(dictId));
+        }
+        return doubleSet;

Review comment:
       It looks like converting `double` to `float` would lead to a relative error of around 0.5% in the typical case, but it doesn't have an upper bound




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #8189: Add DistinctCountSmartHLLAggregationFunction which automatically stores distinct values in Set or HyperLogLog based on cardinality

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #8189:
URL: https://github.com/apache/pinot/pull/8189#discussion_r804687447



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java
##########
@@ -152,6 +123,50 @@ private double toDouble(Comparable value) {
     }
   }
 
+  private Set getDistinctValueSet(Dictionary dictionary) {
+    int dictionarySize = dictionary.length();
+    switch (dictionary.getValueType()) {
+      case INT:
+        IntOpenHashSet intSet = new IntOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          intSet.add(dictionary.getIntValue(dictId));
+        }
+        return intSet;
+      case LONG:
+        LongOpenHashSet longSet = new LongOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          longSet.add(dictionary.getLongValue(dictId));
+        }
+        return longSet;
+      case FLOAT:
+        FloatOpenHashSet floatSet = new FloatOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          floatSet.add(dictionary.getFloatValue(dictId));
+        }
+        return floatSet;
+      case DOUBLE:
+        DoubleOpenHashSet doubleSet = new DoubleOpenHashSet(dictionarySize);
+        for (int dictId = 0; dictId < dictionarySize; dictId++) {
+          doubleSet.add(dictionary.getDoubleValue(dictId));
+        }
+        return doubleSet;

Review comment:
       this appears to be beneficial:
   
   ```java
         RoaringBitmap bitmap = new RoaringBitmap();
         FloatOpenHashSet set = new FloatOpenHashSet();
         long bitmapBefore = GraphLayout.parseInstance(bitmap).totalSize();
         long setBefore = GraphLayout.parseInstance(set).totalSize();
         for (int i = 0; i < 1 << 20; i++) {
           float f = ThreadLocalRandom.current().nextFloat() * ThreadLocalRandom.current().nextLong();
           bitmap.add(Float.floatToIntBits(f));
           set.add(f);
         }
         System.err.println("bitmap: " + ((GraphLayout.parseInstance(bitmap).totalSize() - bitmapBefore) >>> 20) + "MB");
         System.err.println(GraphLayout.parseInstance(bitmap).toFootprint());
         System.err.println("set: " + ((GraphLayout.parseInstance(set).totalSize() - setBefore) >>> 20) + "MB");
         System.err.println(GraphLayout.parseInstance(set).toFootprint());
   ```
   
   ```
   bitmap: 2MB
   org.roaringbitmap.RoaringBitmap@36a6bea6d footprint:
        COUNT       AVG       SUM   DESCRIPTION
         3618       706   2555408   [C
            1     15008     15008   [Lorg.roaringbitmap.Container;
         3617        24     86808   org.roaringbitmap.ArrayContainer
            1        24        24   org.roaringbitmap.RoaringArray
            1        16        16   org.roaringbitmap.RoaringBitmap
         7238             2657264   (total)
   
   
   set: 7MB
   it.unimi.dsi.fastutil.floats.FloatOpenHashSet@a62c7cdd footprint:
        COUNT       AVG       SUM   DESCRIPTION
            1   8388632   8388632   [F
            1        48        48   it.unimi.dsi.fastutil.floats.FloatOpenHashSet
            2             8388680   (total)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org