You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/05/16 20:23:08 UTC

[GitHub] [incubator-mxnet] dmidge8 opened a new issue #14975: Mxnet doesn't reclaim the memory once there is a cudaMalloc fail

dmidge8 opened a new issue #14975: Mxnet doesn't reclaim the memory once there is a cudaMalloc fail
URL: https://github.com/apache/incubator-mxnet/issues/14975

## Description
When there is a training leading to a GPU memory error (e.g. the batch size is too high), no following training can train, even with a normal amount of memory (e.g. a corrected batch size), when it is done in the same process.

## Details
When there is a training, the memory is allocated and used as necessary. Once it is the time of freeing, the memory remains allocated but is transferred to a pool of "available memory", for the next training to be. If there is no more available memory, but some needs to be allocated, Mxnet will look for the next memory that it can free to reuse it. However, if the previous training crashed, because the allocated memory was too high, the allocated memory is not notified as "available memory". Thus, it can't be freed nor reused if the following process needs it.
The only way to come around that problem is to stop the process, and start again the trainings.

## Environment info (Required)
I am using the C++ version, compiled from source. It is the 1.2.0.
It is running under windows 10.
The cuda version is 9.2. Cudnn 7 is used.

## Build info
Compiler visual studio 2017, with latest update. The flags to compile Cuda and cudnn are enabled.

## Minimum reproducible example
Just take one of the C++ examples. Do two training in a for loop. One with a way to high batch size. It would be a memory error. Then, the second loop would have an appropriate batch size. The second one will have a memory problem because the memory is not available.
It as to remain in the same process. Once the processed is killed, the GPU memory is freed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services