You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/11 17:08:01 UTC
[GitHub] [druid] suneet-s commented on a change in pull request #9317: ANY Aggregator should not skip null values implementation

suneet-s commented on a change in pull request #9317: ANY Aggregator should not skip null values implementation
URL: https://github.com/apache/druid/pull/9317#discussion_r377773517
 
 

 ##########
 File path: docs/querying/aggregations.md
 ##########
 @@ -238,7 +238,7 @@ Note that queries with first/last aggregators on a segment created with rollup e
 
 (Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries.
 
-If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any non-null value.
+If `druid.generic.useDefaultValueForNull=true` aggregation can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then aggregation will returns any value including null.
 
 Review comment:
   I think we should re-write this to explain why someone would use this aggregator similar to how it's explained in the snowflake docs - https://docs.snowflake.net/manuals/sql-reference/functions/any_value.html
   I'm not sure where the correct place in the docs is to explain this - since technically this is the spec for the native query and we have another page with a spec for sql.
   
   Here's my suggestion for the why:
   
   ```
   ANY aggregator can be used to simplify and optimize the performance of GROUP BY statements. A common problem for many queries is that the result of a query with a GROUP BY clause can only contain expressions used in the GROUP BY clause itself, or results of aggregate functions
   
   select customer.id , customer.name , sum(orders.value)
       from customer
       join orders on customer.id = orders.customer_id
       group by customer.id , customer.name;
   
   Since we know that customer.id can have only one name, this can be optimized as
   
   select customer.id , ANY(customer.name) , sum(orders.value)
       from customer
       join orders on customer.id = orders.customer_id
       group by customer.id , customer.name;
   ```
   
   I should also point out, with the current implementation of aggregators, there is no advantage to using an ANY aggregator vs a min aggregator, but maybe that will change in the future 🤷‍♂ 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org