You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "milenkovicm (via GitHub)" <gi...@apache.org> on 2023/04/26 09:39:07 UTC

[GitHub] [arrow-datafusion] milenkovicm commented on issue #1570: Memory Limited GroupBy (Externalized / Spill)

milenkovicm commented on issue #1570:
URL: https://github.com/apache/arrow-datafusion/issues/1570#issuecomment-1523105149

   I'm not sure I see many benefits of having it serializable, would agree with @crepererum 
   Now this discussion would make more sense if we would know more about your implementation. 
   
   IMHO, aggregation should start with hash map, we can assume that there is not going to be spill, if we're wrong we would pay penalty of being wrong as we will have to sort it before spill. 
   
   Once we have it spill to disc I'd argue it would make more sense to switch from hash map to b-tree, as we would need to merge it with spill, it is slower but from my experience it is a bit faster than sorting hash map. 
   
   Spilling can be implemented using two column parquet file (key: blob, value: blob) . 
   
   Implementation like this works quite well from my experience, especially that in most cases we wont trigger spill 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org