You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/22 16:44:01 UTC

[GitHub] [arrow] alamb commented on pull request #8222: ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col)

alamb commented on pull request #8222:
URL: https://github.com/apache/arrow/pull/8222#issuecomment-696842287


   @drusso  I think you are correct that we would need a separate group by operator for each count distinct and then combine them together:
   
   so `SELECT c1, COUNT(DISTINCT c2), COUNT(DISTINCT c3) FROM t1 GROUP BY c1` might look like
   
   ```
   HashAggregateExec: // this second phase then counts
     group_expr:
       Column(c1)
     aggr_expr:
       CountReduce(Column(c2))
     input:
       HashAggregateExec: // this first agg expr finds all distinct values of (c1,c2)
         group_expr:
           Column(c1), Column(c2)
           input:
             CsvExec:
   
   JOIN ON (c1):
   
   HashAggregateExec: // this second phase then counts
     group_expr:
       Column(c1)
     aggr_expr:
       CountReduce(Column(c3))
     input:
       HashAggregateExec: // this first agg expr finds all distinct values of (c1,c2)
         group_expr:
           Column(c1), Column(c3)
           input:
             CsvExec:
   ```
   
   Or something. I like you suggestion to get an implementation in (this one) and then iterate as needed


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org