You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/04/25 21:33:06 UTC

[GitHub] [arrow-datafusion] tustvold commented on pull request #6034: Implement Streaming Aggregation: Do not break pipeline in aggregation if group by columns are ordered

tustvold commented on PR #6034:
URL: https://github.com/apache/arrow-datafusion/pull/6034#issuecomment-1522449963

   > to prevent unnecessary empty vector creation
   
   FWIW an empty vector just contains a `NonNull::dangling`, it doesn't allocate anything, so wrapping it in an `Option` is redundant. That being said the additions to `GroupState` will make it non-trivially larger, which could conceivably cause performance regressions.
   
   I'm not very familiar with the group by code anymore, but the use of per-group allocations does immediately stand out to me as at risk of thrashing the memory allocator, and consequently making the code not just slow but also wildly unpredictable. I'm aware this isn't something introduced by this PR, but revisiting this design (see #4973) may make it easier to make changes in this area without introducing regressions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org