You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by Lai Wei <ro...@gmail.com> on 2019/06/04 07:14:09 UTC

Re: [DISCUSS] 1.5.0 Release Plan

Hi dev@,

Here are the updated release tracker and the timeline for 1.5.0 release:
https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Plan+and+Status
We are working on 2 last blockers on nightly test failure[1][2] and would
like to fix them by 06/04 so we can start to tag 1.5.0

For [1] I am in favor of disabling the test if we can't fix in time,
similar to what we did on CPU failure in [4].
Large tensor tests require a lot of memory and we don't have a good
solution on what's the best way to test them yet. [5]

For [2], it should be fixed by [3]

[1] large tensor nightly GPU failure:
https://github.com/apache/incubator-mxnet/issues/14981
[2] AMP tutorial nightly GPU failure:
https://github.com/apache/incubator-mxnet/issues/15028
[3] nightly fix: https://github.com/apache/incubator-mxnet/pull/15141
[4] large tensor nightly CPU:
https://github.com/apache/incubator-mxnet/issues/14980
[5] testability of large tensor discussion:
https://lists.apache.org/thread.html/d7f397c3f32196cb66ad9deae55dcf9a06dda56b37cbb0399ea1799f@%3Cdev.mxnet.apache.org%3E

Any help will be appreciated, thanks a lot!


Best Regards

Lai


On Fri, May 31, 2019 at 2:31 PM Haibin Lin <ha...@gmail.com> wrote:

> Hi dev@,
>
> Quick update on the gluonnlp issue. Lai and I worked together to test
> gluonnlp and MXNet with different configurations, and found that the use of
> GELU operator in fp16 is causing the divergence. It was a very recent
> change in gluonnlp, and it can be avoided by reverting the change in
> GluonNLP. This doesn't block 1.5 release anymore.
>
> Best,
> Haibin
>
> On Thu, May 30, 2019 at 11:33 AM Lai Wei <ro...@gmail.com> wrote:
>
> > Hi dev@,
> >
> > Quick update on the 1.5.0 release, all previous tracked PRs have been
> > merged and CI is back to normal again, please rebase your PR.
> > Again, I would like to encourage downstream projects to test against
> latest
> > MXNet now to discover bugs and regressions early, really appreciate your
> > help.
> >
> > We still have 3 new open issues/PRs to track:
> > 1. Gluon NLP BERT training Haibin mentioned
> > 2. https://github.com/apache/incubator-mxnet/pull/15039
> > 3. https://github.com/apache/incubator-mxnet/pull/15097
> >
> > Thanks!
> >
> > Best Regards
> >
> > Lai
> >
> >
> > On Tue, May 28, 2019 at 9:32 AM Haibin Lin <ha...@gmail.com>
> > wrote:
> >
> > > Hi dev@,
> > >
> > > I was testing GluonNLP with MXNet master, and found that BERT training
> > > crashes a few hours after I launch the job. I can confirm that MXNet
> pip
> > > package 20190412 works fine. I am bisecting changes in MXNet/GluonNLP
> to
> > > check what causes the problem. I'll send an update as soon as I find
> the
> > > root cause, or if I find any workaround.
> > >
> > > Thanks,
> > > Haibin
> > >
> > > On Thu, May 23, 2019 at 2:12 AM Lin Yuan <ap...@gmail.com> wrote:
> > >
> > > > Hi Lai,
> > > >
> > > > One important PR that is currently blocked by a Flaky TensorRT test:
> > > >
> > > > https://github.com/apache/incubator-mxnet/pull/15041
> > > >
> > > > I have retriggered it several times. If it fails again, I may need CI
> > > team
> > > > to help disable this test. It has been reported by multiple people:
> > > > https://github.com/apache/incubator-mxnet/issues/14978
> > > >
> > > > Thanks,
> > > >
> > > > Lin
> > > >
> > > > On Wed, May 22, 2019 at 11:38 PM Zhao, Patric <patric.zhao@intel.com
> >
> > > > wrote:
> > > >
> > > > > Thanks, Lai.
> > > > >
> > > > > With the great helps from the community, all PRs listed in the
> > roadmap
> > > > are
> > > > > done :)
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-mxnet/issues/14619#issuecomment-480110642
> > > > >
> > > > > Update the status of the below list
> > > > >
> > > > >  - [1] PR#14713 is almost done and wait for internal validation
> > results
> > > > >  - [2] PR#14893 is merged
> > > > >  - [3] PR#15031 is merged
> > > > >  - [7] PR#15038 new PR to fix the bug in C++ interface, will be
> > merged
> > > > > soon after the review.
> > > > >
> > > > > Feel free to let me know if anything our team can help :)
> > > > >
> > > > > BR,
> > > > >
> > > > > --Patric
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Lai Wei [mailto:royweilai@gmail.com]
> > > > > > Sent: Thursday, May 23, 2019 6:05 AM
> > > > > > To: dev@mxnet.incubator.apache.org
> > > > > > Subject: Re: [DISCUSS] 1.5.0 Release Plan
> > > > > >
> > > > > > Hi @dev,
> > > > > >
> > > > > > Thanks for working hard for the 1.5 release, since there has been
> > > > several
> > > > > > release blockers (mostly fixed). We are extending the code freeze
> > to
> > > > > Friday
> > > > > > 05/22/2019. Right now we are tracking the following 5 open
> > > > > PRs[1][2][3][4][5]
> > > > > > and 1 issue[6]. Please let us know if you need more time.
> > > > > >
> > > > > > I would like to encourage all downstream projects to test with
> > latest
> > > > > MXNet
> > > > > > to avoid any incompatibility in the coming 1.5.0 release. If you
> > have
> > > > any
> > > > > > issues that may block the release, please let us know.
> > > > > > Thank you very much.
> > > > > >
> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/14713
> > > > > > [2] https://github.com/apache/incubator-mxnet/pull/14893
> > > > > > [3] https://github.com/apache/incubator-mxnet/pull/15031
> > > > > > [4] https://github.com/apache/incubator-mxnet/pull/15039
> > > > > > [5] https://github.com/apache/incubator-mxnet/pull/15041
> > > > > > [6] https://github.com/apache/incubator-mxnet/issues/15034
> > > > > >
> > > > > >
> > > > > > Best Regards
> > > > > >
> > > > > > Lai
> > > > > >
> > > > > >
> > > > > > On Wed, May 15, 2019 at 9:05 PM Junru Shao <
> > junrushao1994@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi folks,
> > > > > > >
> > > > > > > Here I may have a release blocker for 1.5.0 about
> implementation
> > of
> > > > > > > dynamic shape mechanism, which somehow conflicts with Gluon's
> > > > > > deferred
> > > > > > > initialization [1].
> > > > > > >
> > > > > > > [1] https://github.com/dmlc/gluon-nlp/issues/706
> > > > > > >
> > > > > > > On Wed, May 15, 2019 at 12:09 PM Anirudh Subramanian <
> > > > > > > anirudh2290@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Lai,
> > > > > > > >
> > > > > > > > From the discussion I had with Nvidia offline they are
> > targeting
> > > on
> > > > > > > pushing
> > > > > > > > the required changes today.
> > > > > > > > Since this is important feature for the release, if this gets
> > > > > > > > delayed and cannot  be merged by 05/17/2019, the code freeze
> > date
> > > > > > > > may need to be changed.
> > > > > > > >
> > > > > > > > Anirudh
> > > > > > > >
> > > > > > > > On Wed, May 15, 2019 at 1:23 AM Lv, Tao A <
> tao.a.lv@intel.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi dev,
> > > > > > > > >
> > > > > > > > > We see there are several github issues [1][2][3][4] about
> > mxnet
> > > > > > > > > windows build experience. The team is working intensively
> > > > > > > > > [5][6][7] on that to
> > > > > > > > fix
> > > > > > > > > some problems of MKL-DNN build on windows. We hope these
> > fixes
> > > > > > can
> > > > > > > catch
> > > > > > > > > the code freeze and finally enter the 1.5.0 release.
> > > > > > > > >
> > > > > > > > > The PR against mshadow (#374) was already merged and MXNet
> PR
> > > > > > > > > #14877 is under review - great thanks to CI team for
> helping
> > on
> > > > > > > > > the MKL
> > > > > > > > installation
> > > > > > > > > request. PR #14952 is document change according to build
> > logic
> > > > > > > > > changes
> > > > > > > in
> > > > > > > > > PR #14877. So I think these two PRs should be merged
> > > > > simultaneously.
> > > > > > > > > Currently #14877 is experiencing a CI response problem.
> > > > > > > > >
> > > > > > > > > Please take your time to have a look at these two PRs. Your
> > > > > > > > > comments
> > > > > > > and
> > > > > > > > > suggestions are highly appreciated.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > -tao
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/incubator-mxnet/issues/14670
> > > > > > > > > [2] https://github.com/apache/incubator-mxnet/issues/14335
> > > > > > > > > [3] https://github.com/apache/incubator-mxnet/issues/14203
> > > > > > > > > [4] https://github.com/apache/incubator-mxnet/issues/14085
> > > > > > > > > [5] https://github.com/apache/incubator-mxnet/pull/14877
> > > > > > > > > [6] https://github.com/dmlc/mshadow/pull/374
> > > > > > > > > [7] https://github.com/apache/incubator-mxnet/pull/14952
> > > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Lai Wei [mailto:royweilai@gmail.com]
> > > > > > > > > Sent: Wednesday, May 15, 2019 2:57 PM
> > > > > > > > > To: dev@mxnet.incubator.apache.org
> > > > > > > > > Subject: Re: [DISCUSS] 1.5.0 Release Plan
> > > > > > > > >
> > > > > > > > > Hi Anirudh,
> > > > > > > > >
> > > > > > > > > I see there was an offline disucssion <
> > > > > > > > >
> > > > > > > >
> > > > > > > https://github.com/apache/incubator-
> > > > > > mxnet/pull/14173#pullrequestreview
> > > > > > > -235846341
> > > > > > > > > >
> > > > > > > > > and I have updated the AMP feature and your project on the
> > > > release
> > > > > > > > tracker
> > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Plan+a
> > > > > > > nd+Status
> > > > > > > > > >
> > > > > > > > > ,
> > > > > > > > > Please let me know if you have any updates.
> > > > > > > > >
> > > > > > > > > Hi @dev,
> > > > > > > > > This is a gentle reminder that  the code freeze for 1.5.0
> > > release
> > > > > > > > > is on 05/17/2019, please let us know if you have any WIP
> pull
> > > > > > > > > requests aiming
> > > > > > > > for
> > > > > > > > > 1.5.0 that needs attention.
> > > > > > > > > Please understand we already have around 650 commits in
> > master
> > > > > > > > > that
> > > > > > > need
> > > > > > > > > to be released in time. We understand TensorRT test in CI
> is
> > > > > > > > > failing
> > > > > > > and
> > > > > > > > > are trying to fix it. Meanwhile please update the tracker
> if
> > > > there
> > > > > > > > > is
> > > > > > > any
> > > > > > > > > change:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Plan+a
> > > > > > > nd+Status
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > Lai
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, May 8, 2019 at 11:58 AM Anirudh Subramanian <
> > > > > > > > anirudh2290@gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Sheng,
> > > > > > > > > >
> > > > > > > > > > I had a discussion with nvidia folks offline today
> > (@ptrendx
> > > > et.
> > > > > > > al.).
> > > > > > > > > > I strongly feel that the AMP feature should be included
> as
> > > part
> > > > > > > > > > of
> > > > > > > the
> > > > > > > > > > release:
> > > https://github.com/apache/incubator-mxnet/pull/14173
> > > > .
> > > > > > > > > > The PR is aimed for completion for next week but reviews
> > and
> > > > RFC
> > > > > > > > > > discussions may take some time. I would request to extend
> > the
> > > > > > > > > > release code freeze by 2 weeks.
> > > > > > > > > > Also, I would like to include
> > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > https://cwiki.apache.org/confluence/display/MXNET/Conversion+from+FP32
> > > > > > > > > > +to+Mixed+Precision+Models
> > > > > > > > > > which
> > > > > > > > > > depends on the AMP PR.
> > > > > > > > > > I am also aiming for adding a PR by this week end or
> early
> > > next
> > > > > > > > > > week, but reviews will take longer than May 17th.
> > > > > > > > > >
> > > > > > > > > > Anirudh
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, May 6, 2019 at 11:49 PM Sheng Zha <
> > > szha.pvg@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > While 1.4.1 vote on general@incubator is still on
> going,
> > > I’d
> > > > > > > > > > > like
> > > > > > > to
> > > > > > > > > > > propose that we start preparing 1.5.0 release.
> > > > > > > > > > >
> > > > > > > > > > > 1.5.0 will include changes that dates back to last year
> > and
> > > > > > > > > > > there has
> > > > > > > > > > been
> > > > > > > > > > > a lot of new features and improvements in it, so it
> will
> > > > > > > > > > > likely
> > > > > > > time
> > > > > > > > > > > us more time to prepare than 1.4.1. I propose the
> > following
> > > > > > > timeline:
> > > > > > > > > > > - Cut release branch: release branch already cut. Will
> > sync
> > > > > > > > > > > with master branch on 5/15/2019 EOD.
> > > > > > > > > > > - Code freeze: 5/17/2019. No more changes unless the
> > > release
> > > > > > > > > > > branch is in a broken state.
> > > > > > > > > > > - Tag and vote: 5/20/2019 onward.
> > > > > > > > > > >
> > > > > > > > > > > Lai Wei (roywei@) expressed to me offline that he’s
> > > willing
> > > > to
> > > > > > > help
> > > > > > > > > > drive
> > > > > > > > > > > this release as release manager, and I’m happy to help
> > > again
> > > > > > > > > > > as
> > > > > > > > > > committer.
> > > > > > > > > > >
> > > > > > > > > > > If you have features in progress that you’d like to
> > include
> > > > in
> > > > > > > 1.5.0:
> > > > > > > > > > > - Add your feature to the scope:
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Plan+a
> > > > > > > > > > nd+Status
> > > > > > > > > > > - Indicate in this thread:
> > > > > > > > > > >   - how confident you are about making it happen before
> > the
> > > > > > > > > > > code
> > > > > > > > > freeze.
> > > > > > > > > > > If not confident, provide estimate for a more
> manageable
> > > code
> > > > > > > freeze
> > > > > > > > > > > date so that people can discuss whether to extend the
> > > > deadline
> > > > > > > > > > > or
> > > > > > > to
> > > > > > > > > > > skip one release for it.
> > > > > > > > > > > - whether your PR requires more attention to make it
> > > happen.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your attention. Comments and suggestions are
> > > also
> > > > > > > welcome.
> > > > > > > > > > >
> > > > > > > > > > > -sz
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>