You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/22 12:30:57 UTC

[GitHub] [arrow] drusso commented on pull request #8222: ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col)

drusso commented on pull request #8222:
URL: https://github.com/apache/arrow/pull/8222#issuecomment-696689994


   Thanks for the review/feedback all!
   
   @jorgecarleitao: 
   
   > it may be worth take a look at #8172 , where we are trying to improve how to declare and run aggregate expressions. Many parts of this PR are in conflict with that one (at API level, all the implementation here is valid).
   
   I will get the changes here updated and integrated with #8172. It looks like they've already landed 👍 
   
   > Out of curiosity: is the DISTINCT something that is applicable to an aggregate expression in general, or is something that is only applicable to a subset of aggregations (such as count, sum)?
   
   To my knowledge `DISTINCT` is applicable to all aggregations. Though there are some cases where `DISTINCT` won't have any effect on the result, for example the results of `MIN(col)` and `MIN(DISTINCT col)` are always the same. This is something the logical (or physical) planner can optimize for – there's no need to do the work to compute the distinct set in those cases. 
   
   (Also note that `DISTINCT` is applicable outside the context of aggregations, for example `SELECT DISTINCT c1, c2 FROM t1`.)
   
   @alamb:
   
   > One potential downside of the approach in this PR is that it will likely struggle when there are a large number of distinct values per group (as the FnvHashSet will be huge) as well as needing special support for each datatype.
   
   Agreed. I'm happy to continue iterating and improving on the work here. 
   
   > Another approach we could consider is to use a HashAggregateExpr operation as a pre-filter to remove duplicates.
   
   I had this thought as well, however – and correct me if I'm wrong – it doesn't generalize to a scenario like:
   
   ```
   SELECT c1, COUNT(DISTINCT c2), COUNT(DISTINCT c3) FROM t1 GROUP BY c1
   ```
   
   For the `COUNT(DISTINCT c2)` expression we would want distinct pairs of `(c1, c2)`; for the `COUNT(DISTINCT c3)` expression we would want distinct pairs of `(c1, c3)`. But we can't do both simultaneously. The existing implementation handles that scenario. Perhaps there's some ideas here we can use for further optimizations, though. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org