You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by Leonard Lausen <l-...@lausen.nl> on 2018/08/16 01:50:43 UTC

Release blocker? - buggy topk Op

Recent changes in mxnet master introduced a bug into the topk operator.
 Below code example will output [ 274232. 179574. 274233. 274231.] with
 mxnet-cu90==1.3.0b20180810 but [ 274232. 179574. 274232. 274232.] with
 mxnet-cu90==1.3.0b20180814. Likely #12085 is at fault.

See https://github.com/apache/incubator-mxnet/issues/12197 for more
info.

I think this should be considered a release blocker for the 1.3 release.

Note this breaks some parts of the KDD 18 MXNet / Gluon tutorial which
is scheduled for next Tuesday
http://www.kdd.org/kdd2018/hands-on-tutorials/view/mxnet-with-a-focus-on-nlp
. (We can work around by asking people to install the 0810 version
though.)


Re: Release blocker? - buggy topk Op

Posted by Roshani Nagmote <ro...@gmail.com>.
Thanks Leonard for raising this issue.
@ciyong Thanks for submitting the fix. I will be tracking mentioned PR
#12202 <https://github.com/apache/incubator-mxnet/pull/12202> for release.

-Roshani

On Thu, Aug 16, 2018 at 6:45 AM Zhao, Patric <pa...@intel.com> wrote:

> Hi Leonard,
>
> Thanks to raising the issue of topk op.
>
> The root cause is from the current API design which used float data type
> to represent the integer index, and as we know, the float type could NOT
> express the large integer precisely.
> (I have no offense. I know I missed some backgrounds and I think the
> current design is very good).
>
> The new CI#12085 changes the computation order and make this issue looks
> more significant. Essentially, the bug will happen when the index is large
> whatever with or without the new CI.
> One line example code can trigger the issue,
> 'print(mx.nd.topk(mx.nd.array(np.arange(256*300096).reshape(8, -1)), k=4))'.
>
> Thus, the real fix is to change the API interface and use INT for the
> index. But it might introduce compatibility issue to current
> framework/topology due to API change.
> I am not sure we need to change in the last minutes of release 1.3
> (actually, we can contribute to it).
>
> Currently, we submitted a fix (#12202) to make the computation order as
> same as before and still much faster :)
>
> Apologies for the confusion and feel free to let us know for any feedback.
>
> Thanks,
>
> --Patric
>
>
> > -----Original Message-----
> > From: Leonard Lausen [mailto:l-software@lausen.nl]
> > Sent: Thursday, August 16, 2018 9:51 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Release blocker? - buggy topk Op
> >
> > Recent changes in mxnet master introduced a bug into the topk operator.
> >  Below code example will output [ 274232. 179574. 274233. 274231.] with
> >  mxnet-cu90==1.3.0b20180810 but [ 274232. 179574. 274232. 274232.] with
> > mxnet-cu90==1.3.0b20180814. Likely #12085 is at fault.
> >
> > See https://github.com/apache/incubator-mxnet/issues/12197 for more
> info.
> >
> > I think this should be considered a release blocker for the 1.3 release.
> >
> > Note this breaks some parts of the KDD 18 MXNet / Gluon tutorial which is
> > scheduled for next Tuesday http://www.kdd.org/kdd2018/hands-on-
> > tutorials/view/mxnet-with-a-focus-on-nlp
> > . (We can work around by asking people to install the 0810 version
> > though.)
>
>

RE: Release blocker? - buggy topk Op

Posted by "Zhao, Patric" <pa...@intel.com>.
Hi Leonard,

Thanks to raising the issue of topk op.

The root cause is from the current API design which used float data type to represent the integer index, and as we know, the float type could NOT express the large integer precisely.
(I have no offense. I know I missed some backgrounds and I think the current design is very good).

The new CI#12085 changes the computation order and make this issue looks more significant. Essentially, the bug will happen when the index is large whatever with or without the new CI. 
One line example code can trigger the issue, 'print(mx.nd.topk(mx.nd.array(np.arange(256*300096).reshape(8, -1)), k=4))'.

Thus, the real fix is to change the API interface and use INT for the index. But it might introduce compatibility issue to current framework/topology due to API change.
I am not sure we need to change in the last minutes of release 1.3 (actually, we can contribute to it).

Currently, we submitted a fix (#12202) to make the computation order as same as before and still much faster :)

Apologies for the confusion and feel free to let us know for any feedback.

Thanks,

--Patric


> -----Original Message-----
> From: Leonard Lausen [mailto:l-software@lausen.nl]
> Sent: Thursday, August 16, 2018 9:51 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Release blocker? - buggy topk Op
> 
> Recent changes in mxnet master introduced a bug into the topk operator.
>  Below code example will output [ 274232. 179574. 274233. 274231.] with
>  mxnet-cu90==1.3.0b20180810 but [ 274232. 179574. 274232. 274232.] with
> mxnet-cu90==1.3.0b20180814. Likely #12085 is at fault.
> 
> See https://github.com/apache/incubator-mxnet/issues/12197 for more info.
> 
> I think this should be considered a release blocker for the 1.3 release.
> 
> Note this breaks some parts of the KDD 18 MXNet / Gluon tutorial which is
> scheduled for next Tuesday http://www.kdd.org/kdd2018/hands-on-
> tutorials/view/mxnet-with-a-focus-on-nlp
> . (We can work around by asking people to install the 0810 version
> though.)