You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by Leonard Lausen <le...@lausen.nl> on 2018/07/24 01:47:07 UTC

Should MXNet 1.3 contain a buggy version of nn.Embedding backward by default?

Currently the default kernel of nn.Embedding backward is known to be
buggy on P3 instances or using Cuda 9.2 (though the issue also occurs on
other instances with earlier version of Cuda, but less often).

https://github.com/apache/incubator-mxnet/issues/11314

There is currently an opt-in for using a bug-free kernel, but it is not
the default. However, the bug-free kernel is used by default for shape
smaller 16384.

Should MXNet ship a more efficient but buggy kernel in v1.3 or use a
correct but less efficient kernel by default? As MXNet v1.3 is likely to
be used a lot with Cuda 9.2 I believe the default behavior should be
changed to use the bug-free but less efficient Kernel. Correctness and
providing a good user experience should be No. 1 here (?). Then users
that want a faster but buggy backward kernel can still select to do so.
Note this only affects the backward pass.

Hao did related work on improving the take operator
https://github.com/apache/incubator-mxnet/pull/11326
https://github.com/apache/incubator-mxnet/pull/11795 which also fixes
the issue, but he found it to be only "slightly faster" compared to the
bug-free kernel that is currently under opt-in while leading to CI
failures on Windows.

In my experience, there is no speed difference between the current buggy and
opt-in bug-free kernel, but the GPU utilization of the latter is 100% compared
to 60% of the former (benchmark script:
https://github.com/apache/incubator-mxnet/pull/11795#issuecomment-405808567 )

Re: Should MXNet 1.3 contain a buggy version of nn.Embedding backward by default?

Posted by sebastianb <se...@wolfram.com.INVALID>.
> As MXNet v1.3 is likely to be used a lot with Cuda 9.2 I believe the default behavior should be changed to use the bug-free but less efficient Kernel.


It would be crazy to do anything else, to be honest. Its a terrible philosophy to say to users 'you can't rely on MXNet to have correct behaviour on the fastest GPU, rather you need to follow the forums/issues lists in order to know that you need to opt-in to a bug-free implementation'.

> On Jul 24, 2018, at 3:47 AM, Leonard Lausen <le...@lausen.nl> wrote:
> 
> Currently the default kernel of nn.Embedding backward is known to be
> buggy on P3 instances or using Cuda 9.2 (though the issue also occurs on
> other instances with earlier version of Cuda, but less often).
> 
> https://github.com/apache/incubator-mxnet/issues/11314
> 
> There is currently an opt-in for using a bug-free kernel, but it is not
> the default. However, the bug-free kernel is used by default for shape
> smaller 16384.
> 
> Should MXNet ship a more efficient but buggy kernel in v1.3 or use a
> correct but less efficient kernel by default? As MXNet v1.3 is likely to
> be used a lot with Cuda 9.2 I believe the default behavior should be
> changed to use the bug-free but less efficient Kernel. Correctness and
> providing a good user experience should be No. 1 here (?). Then users
> that want a faster but buggy backward kernel can still select to do so.
> Note this only affects the backward pass.
> 
> Hao did related work on improving the take operator
> https://github.com/apache/incubator-mxnet/pull/11326
> https://github.com/apache/incubator-mxnet/pull/11795 which also fixes
> the issue, but he found it to be only "slightly faster" compared to the
> bug-free kernel that is currently under opt-in while leading to CI
> failures on Windows.
> 
> In my experience, there is no speed difference between the current buggy and
> opt-in bug-free kernel, but the GPU utilization of the latter is 100% compared
> to 60% of the former (benchmark script:
> https://github.com/apache/incubator-mxnet/pull/11795#issuecomment-405808567 )


Re: Should MXNet 1.3 contain a buggy version of nn.Embedding backward by default?

Posted by Haibin Lin <ha...@gmail.com>.
Hi Hao,

Did you look at the AddTakeGrad for sparse gradient
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/indexing_op.cu#L77
 ? If I'm not mistaken, Leonard doesn't see nan values generated by the
sparse gradient kernel. The sparse kernel shares similar parallelization
strategy with the dense AddTakeGradLargeBatch and can be easily adapted to
replace the dense kernel by removing the "lookup_table" argument of the
kernel.

Best,
Haibin


On Mon, Jul 23, 2018 at 11:45 PM, Hao Jin <hj...@gmail.com> wrote:

> Hi all,
> Some preliminary benchmark results have been shared on the related PR, and
> what we've found is that based on the sample benchmark with an input on
> which the LargeBatch version is supposed to have a better performance,
> there was no significant increase in performance compared with either the
> new general backward kernel or the AddTakeGrad function, and the LargeBatch
> version is deemed buggy based on Leo's reproduction example given in the
> original issue. I would propose that we delete the LargeBatch version and
> use the AddTakeGrad version by default. If there's no obvious objection
> then we'll go ahead in that direction.
> Hao
>
> On Mon, Jul 23, 2018 at 9:12 PM, Naveen Swamy <mn...@gmail.com> wrote:
>
> > If it is buggy, how does it matter if it is performant or not? I am not
> > seeing the rationale to make the correct version only opt-in.
> >
> >
> > On Mon, Jul 23, 2018 at 6:47 PM, Leonard Lausen <
> > leonard-software@lausen.nl>
> > wrote:
> >
> > > Currently the default kernel of nn.Embedding backward is known to be
> > > buggy on P3 instances or using Cuda 9.2 (though the issue also occurs
> on
> > > other instances with earlier version of Cuda, but less often).
> > >
> > > https://github.com/apache/incubator-mxnet/issues/11314
> > >
> > > There is currently an opt-in for using a bug-free kernel, but it is not
> > > the default. However, the bug-free kernel is used by default for shape
> > > smaller 16384.
> > >
> > > Should MXNet ship a more efficient but buggy kernel in v1.3 or use a
> > > correct but less efficient kernel by default? As MXNet v1.3 is likely
> to
> > > be used a lot with Cuda 9.2 I believe the default behavior should be
> > > changed to use the bug-free but less efficient Kernel. Correctness and
> > > providing a good user experience should be No. 1 here (?). Then users
> > > that want a faster but buggy backward kernel can still select to do so.
> > > Note this only affects the backward pass.
> > >
> > > Hao did related work on improving the take operator
> > > https://github.com/apache/incubator-mxnet/pull/11326
> > > https://github.com/apache/incubator-mxnet/pull/11795 which also fixes
> > > the issue, but he found it to be only "slightly faster" compared to the
> > > bug-free kernel that is currently under opt-in while leading to CI
> > > failures on Windows.
> > >
> > > In my experience, there is no speed difference between the current
> buggy
> > > and
> > > opt-in bug-free kernel, but the GPU utilization of the latter is 100%
> > > compared
> > > to 60% of the former (benchmark script:
> > > https://github.com/apache/incubator-mxnet/pull/11795#
> > > issuecomment-405808567 )
> > >
> >
>

Re: Should MXNet 1.3 contain a buggy version of nn.Embedding backward by default?

Posted by Hao Jin <hj...@gmail.com>.
Hi all,
Some preliminary benchmark results have been shared on the related PR, and
what we've found is that based on the sample benchmark with an input on
which the LargeBatch version is supposed to have a better performance,
there was no significant increase in performance compared with either the
new general backward kernel or the AddTakeGrad function, and the LargeBatch
version is deemed buggy based on Leo's reproduction example given in the
original issue. I would propose that we delete the LargeBatch version and
use the AddTakeGrad version by default. If there's no obvious objection
then we'll go ahead in that direction.
Hao

On Mon, Jul 23, 2018 at 9:12 PM, Naveen Swamy <mn...@gmail.com> wrote:

> If it is buggy, how does it matter if it is performant or not? I am not
> seeing the rationale to make the correct version only opt-in.
>
>
> On Mon, Jul 23, 2018 at 6:47 PM, Leonard Lausen <
> leonard-software@lausen.nl>
> wrote:
>
> > Currently the default kernel of nn.Embedding backward is known to be
> > buggy on P3 instances or using Cuda 9.2 (though the issue also occurs on
> > other instances with earlier version of Cuda, but less often).
> >
> > https://github.com/apache/incubator-mxnet/issues/11314
> >
> > There is currently an opt-in for using a bug-free kernel, but it is not
> > the default. However, the bug-free kernel is used by default for shape
> > smaller 16384.
> >
> > Should MXNet ship a more efficient but buggy kernel in v1.3 or use a
> > correct but less efficient kernel by default? As MXNet v1.3 is likely to
> > be used a lot with Cuda 9.2 I believe the default behavior should be
> > changed to use the bug-free but less efficient Kernel. Correctness and
> > providing a good user experience should be No. 1 here (?). Then users
> > that want a faster but buggy backward kernel can still select to do so.
> > Note this only affects the backward pass.
> >
> > Hao did related work on improving the take operator
> > https://github.com/apache/incubator-mxnet/pull/11326
> > https://github.com/apache/incubator-mxnet/pull/11795 which also fixes
> > the issue, but he found it to be only "slightly faster" compared to the
> > bug-free kernel that is currently under opt-in while leading to CI
> > failures on Windows.
> >
> > In my experience, there is no speed difference between the current buggy
> > and
> > opt-in bug-free kernel, but the GPU utilization of the latter is 100%
> > compared
> > to 60% of the former (benchmark script:
> > https://github.com/apache/incubator-mxnet/pull/11795#
> > issuecomment-405808567 )
> >
>

Re: Should MXNet 1.3 contain a buggy version of nn.Embedding backward by default?

Posted by Naveen Swamy <mn...@gmail.com>.
If it is buggy, how does it matter if it is performant or not? I am not
seeing the rationale to make the correct version only opt-in.


On Mon, Jul 23, 2018 at 6:47 PM, Leonard Lausen <le...@lausen.nl>
wrote:

> Currently the default kernel of nn.Embedding backward is known to be
> buggy on P3 instances or using Cuda 9.2 (though the issue also occurs on
> other instances with earlier version of Cuda, but less often).
>
> https://github.com/apache/incubator-mxnet/issues/11314
>
> There is currently an opt-in for using a bug-free kernel, but it is not
> the default. However, the bug-free kernel is used by default for shape
> smaller 16384.
>
> Should MXNet ship a more efficient but buggy kernel in v1.3 or use a
> correct but less efficient kernel by default? As MXNet v1.3 is likely to
> be used a lot with Cuda 9.2 I believe the default behavior should be
> changed to use the bug-free but less efficient Kernel. Correctness and
> providing a good user experience should be No. 1 here (?). Then users
> that want a faster but buggy backward kernel can still select to do so.
> Note this only affects the backward pass.
>
> Hao did related work on improving the take operator
> https://github.com/apache/incubator-mxnet/pull/11326
> https://github.com/apache/incubator-mxnet/pull/11795 which also fixes
> the issue, but he found it to be only "slightly faster" compared to the
> bug-free kernel that is currently under opt-in while leading to CI
> failures on Windows.
>
> In my experience, there is no speed difference between the current buggy
> and
> opt-in bug-free kernel, but the GPU utilization of the latter is 100%
> compared
> to 60% of the former (benchmark script:
> https://github.com/apache/incubator-mxnet/pull/11795#
> issuecomment-405808567 )
>