You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "korowa (via GitHub)" <gi...@apache.org> on 2023/03/21 07:48:59 UTC

[GitHub] [arrow-datafusion] korowa commented on issue #1599: Memory Limited Joins (Externalized / Spill)

korowa commented on issue #1599:
URL: https://github.com/apache/arrow-datafusion/issues/1599#issuecomment-1477399197

   Looks like now that we are able to fail query in case of breaching memory limit, it's the right time to start working on spills.
   
   Taking into account what has been written above, I guess, next step could be to implement spilling for MergeJoin -- if our final intention to have runtime HJ -> MJ conversion it would be nice to have some guarantees that MJ won't fail for the same reason. I believe MJ spilling logic could be pretty straightforward without any pitfalls -- the naive approach would be to spill buffered-side data in .ipc batch by batch, more complex, and, probably, more effective way to think about would be spilling concatenation of all batches that fit in memory.
   
   After that we could follow-up with what is mentioned in issue description -- HJ -> MJ conversion (I believe #2628 worth to be mentioned here, to unlock ability for more hash joins to be converted), and spilling mechanisms for other join implementations.
   
   If this plan is fine, I'd like to take a stab at MJ spilling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org