You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Rohit Kumar Srivastava <no...@github.com.INVALID> on 2020/11/24 22:24:10 UTC

[apache/incubator-mxnet] [RFC] Improve CPU memory pool management for MXNet process (#19585)

## Problem statement
I noticed that memory pool keeps memory allocated to MXNet process so NDArray or tensors can be allocated faster by our pool. At times the pool size becomes very large and memory may not be released to the pool immediately once NDArray goes out of scope. When I was running Large tensor Nightly Tests(all at once sequentially) then I saw certain tests were causing OOM (even on 720GB RAM machine, p2.16xl) even when they individually took less than 50GB memory. When I added LOG(INFO) to check how much memory MXNet was requesting in bytes it was roughly asking for 7500-8500 GB.
Perhaps memory is not being released back to the pool after tensors go out of scope or there could be an internal memory fragmentation issue in the pool itself. These are my observations from test runs and past experiences of going through “pooled_storage_manager”. I will dive deep into it and try to come up with a suggestion.

## Proposed solutions
1. Make MXNET_GPU_MEM_POOL_TYPE=Unpooled to also apply to CPU basically use similar strategies for CPU memory 
2. Fix fragmentation issue within the pool if any

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19585