You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2022/03/26 01:30:04 UTC
[GitHub] [incubator-mxnet] waytrue17 commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

waytrue17 commented on issue #20959:
URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1079551643


   It looks like the memory leak in the above script is because we instantiate multiple dataloader objects in the for loop. Having one dataloader object seems to mitigate the issue:
   ```
   import mxnet.gluon as gl
   import mxnet as mx
   import gc
   
   if __name__ == "__main__":
       gpu_ctx = mx.gpu()
       model = gl.nn.Embedding(10, 5)
       model.initialize(ctx=gpu_ctx)
       X = mx.random.uniform(shape=(1000, 3))
       dataset = mx.gluon.data.dataset.ArrayDataset(X)
       num_workers = 8
       data_loader = gl.data.DataLoader(
                   dataset,
                   batch_size=1,
                   num_workers=num_workers,
               )
   
       for epoch in range(5):
           for batch in data_loader:
               # move data to gpu
               data_gpu = batch.copyto(mx.gpu())
               # forward
               l = model(data_gpu)
               # force immediate compute
               l.asnumpy()
   
           mx.nd.waitall()
   
           a, b = mx.context.gpu_memory_info(0)
           print(f"num_workers: {num_workers} epoch {epoch}: "
                 f"current gpu memory {(b - a) / (1024 * 1024 * 1024)} GB, "
                 f"Total gpu memory {b / (1024 * 1024 * 1024)} GB.")
           data_loader.refresh()
   ```
   
   ```
   num_workers: 8 epoch 0: current gpu memory 1.43017578125 GB, Total gpu memory 15.78192138671875 GB.
   num_workers: 8 epoch 1: current gpu memory 1.43017578125 GB, Total gpu memory 15.78192138671875 GB.
   num_workers: 8 epoch 2: current gpu memory 1.43017578125 GB, Total gpu memory 15.78192138671875 GB.
   num_workers: 8 epoch 3: current gpu memory 1.43017578125 GB, Total gpu memory 15.78192138671875 GB.
   num_workers: 8 epoch 4: current gpu memory 1.43017578125 GB, Total gpu memory 15.78192138671875 GB.
   ```
   Seems previously we had `mshadow::DeleteStream<gpu>(stream)` to clean up the GPU memory by the life cycle of dataloader object, but it had a [segfault issue](https://github.com/apache/incubator-mxnet/issues/19360). In the workaround [PR](https://github.com/apache/incubator-mxnet/pull/19378), we removed `mshadow::DeleteStream<gpu>(stream)` and relied on the OS to clean up memory at the end of the program. That may explain why we see memory leak when creating multiple dataloaders in the program.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org