You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/18 21:06:33 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #587: Optionally Limit memory used by DataFusion plan

alamb opened a new issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   If DataFusion processes individual partitions that are larger than the available memory system memory, right now it will keep allocating memory from the system until it is killed by the OS or container system. 
   
   Also, when running multiple datafusion plans in the same process, each will consume memory without limit where it may be desirable to reserve / cap memory usage by any individual plan to ensure the plans don't together exceed the system memory budge
   
   Thus, it would be nice if we could give DataFusion's plans a memory budget  which they then stayed under
   
   
   **Describe the solution you'd like**
   1. Add an option to ExecutionConfig that has a “total plan memory budget”
   2. Add logic to each node that requires a memory buffer to ensure it stays under the limit.
   
   The operators that can use large amounts of memory today are:
   1. Sort
   2. Join
   3. GroupByHash
   
   There are many potential ways to ensure the limit is respected:
   1. (Simplest) error if the budget is exceeded
   2. (more complex): employ algorithms that can use secondary storage (e.g. temp files) like sort that spills multiple round of partial sorted results and give a final merge phase for the partition global ordering
   
   **Describe alternatives you've considered**
   There are some interesting tradeoffs between “up front allocation” dividing memory up across all operators that would need it and a more dynamic approach.
   
   This is likely something that will require some major efforts over many different issues -- I suggest we use this issue to implement a simple "error if over limit" strategy and then work on more sophisticated strategies subsequently
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
alamb closed issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] edrevo commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
edrevo commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-864152315


   I would add Repartition as another operation that might use a bunch of memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-894802884


   We should also discuss creating a scheduler in DataFusion (see https://github.com/apache/arrow-datafusion/issues/64) since it is related to this work. Rather than try and run all the things at once, it would be better to schedule work based on the available resources (cores / memory). We would still need the ability to track/limit memory use within operators but the scheduler could be aware of this and only allocate tasks if there is memory budget available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-899594028


   I filed https://github.com/apache/arrow-datafusion/issues/898 for tracking memory used by a plan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] liukun4515 commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-1013697747


   @alamb Maybe we should take the `join` operation into this track.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-899600992


   https://github.com/apache/arrow-datafusion/issues/899 for tracking memory used by individual operators


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
yjshen commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-964742820


   I created a proposal trying to fix this. Please refer to https://docs.google.com/document/d/1BT5HH-2sKq-Jxo51PNE6l9NNd_F-FyyYcyC3SKTnkIA/edit# for the whole proposal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-1013670632


   I have started added a "Progress Tracking" list to the description of this ticket. Please update it with additional items as you discover them. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-1013733620


   > @alamb Maybe we should take the join operation into this track.
   
   It is a good idea @liukun4515  -- I ran out of ambition while typing up Sort and Grouping. I'll try and write up some thoughts on joins later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] liukun4515 commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-1013784903


   > > @alamb Maybe we should take the join operation into this track.
   > 
   > It is a good idea @liukun4515 -- I ran out of ambition while typing up Sort and Grouping. I'll try and write up some thoughts on joins later
   
   I'm not familiar with external operations, I will go through other databases to learn it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #587: Optionally Limit memory used by DataFusion plan

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587#issuecomment-1014631605


   I wrote up some thoughts about externalized joins on https://github.com/apache/arrow-datafusion/issues/1599


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org