You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/22 19:52:13 UTC

[GitHub] [arrow-datafusion] Dandandan commented on issue #27: Implement hash-partitioned hash aggregate

Dandandan commented on issue #27:
URL: https://github.com/apache/arrow-datafusion/issues/27#issuecomment-825140989


   > > We probably need some fast way to have a rough estimate on the number of distinct values in the aggregate keys, maybe dynamically based on the first batch(es).
   > 
   > Does your `TableProvider` column stats work provide any useful base for this in situations where we're running aggregations on original table columns (as opposed to computed exprs) or is that too coarse?
   
   I think yes, for cases where the table is queried directly and we have statistics for distinct values available, those could be using heuristics for `group by` expressions based on cardinality statistics (say if you use column a and b maybe distinct(a)*distinct(b) would be an OK heuristic). 
   We also need the same support of distinct values per column for generalizing the join order optimization rule to more complicated joins and with more expressions than those directly on tables.
   Support for collecting  those statistics (i.e. `ANALYZE TABLE`) would need to be added too.
   
   To support the more general case we probably need a way to estimate the cardinality of the intermediate results, based on sampling one or a couple of batches with the particular group by expression.
   
   I added this requirement in this design doc:
   https://docs.google.com/document/d/17DCBe_HygkjsoMzC4Znw-t8i1okLGlBkf0kp8DfLBhk/edit?usp=drivesdk
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org