You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/04/30 12:55:19 UTC

[GitHub] [druid] clintropolis opened a new pull request #11186: adjust topn heap algorithm to only use known cardinality path when dictionary is unique

clintropolis opened a new pull request #11186:
URL: https://github.com/apache/druid/pull/11186

### Description
This PR adjusts the heap based topN algorithm to only use known the "known" cardinality path only when the dictionary contains unique values. The known cardinality path to aggregate values uses an array based approach, where an array of aggregator arrays the size of the value cardinality is created, and the dictionaryId is expected to index to an array position with the aggregators for that value, as an optimization to avoid a map lookup.

However, when a selector is aggregated which does not have unique dictionaryIds, but does know its cardinality, such as a selector from an `IndexedTable` from a join result which uses the row number as the dictionaryId instead, it means that each dictionaryId will be 'new', and thus have a null array entry and still incur the map interaction this path is trying to avoid.

Instead, these selectors will now just use the map directly by using the cardinality "unknown" path instead.

This PR has:
- [x] been self-reviewed.
- [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] clintropolis commented on a change in pull request #11186: adjust topn heap algorithm to only use known cardinality path when dictionary is unique

Posted by GitBox <gi...@apache.org>.

clintropolis commented on a change in pull request #11186:
URL: https://github.com/apache/druid/pull/11186#discussion_r624643675



##########
File path: processing/src/main/java/org/apache/druid/query/topn/types/StringTopNColumnAggregatesProcessor.java
##########
@@ -108,7 +116,7 @@ public long scanAndAggregate(
       Aggregator[][] rowSelector
   )
   {
-    if (selector.getValueCardinality() != DimensionDictionarySelector.CARDINALITY_UNKNOWN) {
+    if (capabilities.isDictionaryEncoded().and(capabilities.areDictionaryValuesUnique()).isTrue()) {

Review comment:
       Hmm, that is a good point, I guess it is implicit of the current state of things that the only things we have right now that report having unique dictionary ids are things with known cardinality, but I suppose this check needs both things to be true. I'll modify this check both conditions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] clintropolis commented on a change in pull request #11186: adjust topn heap algorithm to only use known cardinality path when dictionary is unique

Posted by GitBox <gi...@apache.org>.

clintropolis commented on a change in pull request #11186:
URL: https://github.com/apache/druid/pull/11186#discussion_r624865281



##########
File path: processing/src/main/java/org/apache/druid/query/topn/types/StringTopNColumnAggregatesProcessor.java
##########
@@ -108,7 +116,7 @@ public long scanAndAggregate(
       Aggregator[][] rowSelector
   )
   {
-    if (selector.getValueCardinality() != DimensionDictionarySelector.CARDINALITY_UNKNOWN) {
+    if (capabilities.isDictionaryEncoded().and(capabilities.areDictionaryValuesUnique()).isTrue()) {

Review comment:
       I've updated the check, and added a comment to try to explain what is going on with the aggregation algorithm selection and why




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jon-wei merged pull request #11186: adjust topn heap algorithm to only use known cardinality path when dictionary is unique

Posted by GitBox <gi...@apache.org>.

jon-wei merged pull request #11186:
URL: https://github.com/apache/druid/pull/11186


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #11186: adjust topn heap algorithm to only use known cardinality path when dictionary is unique

Posted by GitBox <gi...@apache.org>.

abhishekagarwal87 commented on a change in pull request #11186:
URL: https://github.com/apache/druid/pull/11186#discussion_r623898757



##########
File path: processing/src/main/java/org/apache/druid/query/topn/types/StringTopNColumnAggregatesProcessor.java
##########
@@ -108,7 +116,7 @@ public long scanAndAggregate(
       Aggregator[][] rowSelector
   )
   {
-    if (selector.getValueCardinality() != DimensionDictionarySelector.CARDINALITY_UNKNOWN) {
+    if (capabilities.isDictionaryEncoded().and(capabilities.areDictionaryValuesUnique()).isTrue()) {

Review comment:
       so it is not possible to have unique dictionary ids but unknown cardinality? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org