You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/26 12:50:13 UTC

[GitHub] [arrow-datafusion] yjshen commented on issue #3941: Generate runtime errors if the memory budget is exceeded [EPIC]

yjshen commented on issue #3941:
URL: https://github.com/apache/arrow-datafusion/issues/3941#issuecomment-1291985225

   Hi, sorry to join this party late.
   
   I don't quite get why the async is a problem for the memory manager while implementing aggregation.
   
   We could do memory-limited aggregation as follows:
   
   1. try to allocate a fixed-sized buffer (a record batch or a row batch (if row aggregate is possible) with 8096 rows for example.) for incoming groups.
   2. aggregate incoming records, updating existing buffer slots or adding new buffer rows if there's space left for the last pre-allocated buffer.
   3. if the last allocated aggregate buffer is full, we try_grow aggregate's memory by allocating another fixed-sized buffer.
   - if more memory quota is granted, go to 2 and continue aggregating.
   - if we fail to get one more aggregation buffer from the memory manager, we spill (see spill described below).
   4. if the input stream is exhausted, we could either:
   - get a sorted iterator of the in-memory aggregation buffers, do multi-way merging with the spills, get the final results, and free all memory used at last (if the aggregate is final)
   - output the in-memory buffers, and free all the memory (in case the aggregate is partial).
   
   The spill procedure would be:
   
   - if the aggregate is partial, produce partial output based on all the aggregation buffers we have, free all the memory, and go to agg step 1. to handle the following incoming tuples.
   - if the aggregate is final, sort all the current buffers by group key and spill to disk, store it in the spilled files list, and go back to agg step 1. to handle the following incoming tuples.
   
   Also, one could refer to [Apache Spark's hash aggregate impl](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L32-L49) if interested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org