You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "2010YOUY01 (via GitHub)" <gi...@apache.org> on 2023/09/18 18:30:38 UTC

[GitHub] [arrow-datafusion] 2010YOUY01 commented on pull request #7337: feat: Implement quantile_cont()/quantile_disc() aggregate functions

2010YOUY01 commented on PR #7337:
URL: https://github.com/apache/arrow-datafusion/pull/7337#issuecomment-1724157350

   https://github.com/apache/arrow-datafusion/pull/7376 did several smart optimizations for `median()` 
   For example a O(n) quick select in the final evaluate step for aggregation
   
   For `select median(l_partkey) from lineitem` using sf10 parquet TPCH data:
   Before -- ~20s
   After -- ~4s
   Use multi-core sorting -- estimated ~2s
   
   Now multi-core sorting approach seems unnecessary, however, the above query only spends ~1% of time doing quick select, and most time is spent doing data type conversion/copying
   I'll experiment if there is any way to make `median()` faster before finishing this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org