You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by TongKe Xue <tk...@tkxue.org> on 2020/10/15 15:29:25 UTC

How does mxnet efficiently free GPU memory ?

Hi,

  In my very limited understanding:

  * GPU memory is often a bottleneck to training DL
  * Java, not being RAII / refcounted, does not have predictable destructors
  * overloading math ops + auto diff often creates transient GPU tensors
that should later be freed

  Question: does mxnet have any automatic tracking of "this JVM object (1)
is no longer reachable and (2) holds a GPU tensor, so we should free it" ?

Thanks,
--TongKe

Re: How does mxnet efficiently free GPU memory ?

Posted by TongKe Xue <tk...@tkxue.org>.
Hi Qing,

  I think I am understanding something very basic. Perhaps we can work
through this example:

At
https://mxnet.apache.org/versions/1.7/api/java/docs/api/#org.apache.mxnet.javaapi.NDArray
we see a function of signature:

def add(other: NDArray): NDArray

Suppose we have (x: NDArray), (y: NDArray), (z: NDArray), all of the right
dimensions and GPU backed. Furthermore, suppose we do:

out = x * 2.0 + (y * 3.0) + z

My intuition is that this generates temporary values t1, t2, t3 where:

t1 = x * 2.0
t2 = y * 3.0
t3 = t1 + t2
out = t3 + z

However, I am not manually calling dispose on any of t1, t2, t3. Is this
resulting in a memory leak?

--TongKe


On Thu, Oct 15, 2020 at 9:25 AM Qing Lan <la...@live.com> wrote:

> Hi Tongke,
>
> GPU memory sometimes go very large and easily crash the GPU memory limit.
> So it require more frequent GC to solve the issue.
>
> MXNet Java designed NDArray to be autoclosable which allow you to get
> memory GC'ed once the usage is done.
>
> MXNet (C API) have a reference counting system established below, but it
> cannot track the JVM object if it holds a piece of memory space. You will
> have to close the JVM object itself which call the Engine that the
> reference is not used to further clean this piece of memory. So the answer
> will be yes, you will need to manually managing the GPU NDArrays if it
> being used.
>
> Thanks,
> Qing
>
> ________________________________
> From: TongKe Xue <tk...@tkxue.org>
> Sent: Thursday, October 15, 2020 8:29
> To: dev@mxnet.apache.org <de...@mxnet.apache.org>
> Subject: How does mxnet efficiently free GPU memory ?
>
> Hi,
>
>   In my very limited understanding:
>
>   * GPU memory is often a bottleneck to training DL
>   * Java, not being RAII / refcounted, does not have predictable
> destructors
>   * overloading math ops + auto diff often creates transient GPU tensors
> that should later be freed
>
>   Question: does mxnet have any automatic tracking of "this JVM object (1)
> is no longer reachable and (2) holds a GPU tensor, so we should free it" ?
>
> Thanks,
> --TongKe
>

Re: How does mxnet efficiently free GPU memory ?

Posted by Qing Lan <la...@live.com>.
Hi Tongke,

GPU memory sometimes go very large and easily crash the GPU memory limit. So it require more frequent GC to solve the issue.

MXNet Java designed NDArray to be autoclosable which allow you to get memory GC'ed once the usage is done.

MXNet (C API) have a reference counting system established below, but it cannot track the JVM object if it holds a piece of memory space. You will have to close the JVM object itself which call the Engine that the reference is not used to further clean this piece of memory. So the answer will be yes, you will need to manually managing the GPU NDArrays if it being used.

Thanks,
Qing

________________________________
From: TongKe Xue <tk...@tkxue.org>
Sent: Thursday, October 15, 2020 8:29
To: dev@mxnet.apache.org <de...@mxnet.apache.org>
Subject: How does mxnet efficiently free GPU memory ?

Hi,

  In my very limited understanding:

  * GPU memory is often a bottleneck to training DL
  * Java, not being RAII / refcounted, does not have predictable destructors
  * overloading math ops + auto diff often creates transient GPU tensors
that should later be freed

  Question: does mxnet have any automatic tracking of "this JVM object (1)
is no longer reachable and (2) holds a GPU tensor, so we should free it" ?

Thanks,
--TongKe