You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by Lai Wei <ro...@gmail.com> on 2019/06/08 20:11:50 UTC

[VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Dear MXNet community,

This is the 3-day vote to release Apache MXNet (incubating) version 1.5.0.
Voting on dev@ will start June 8, 23:59:59(PST)  and close on June 11,
23:59:59.

1) Link to release notes:
https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes

2) Link to release candidate:

https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0

3) Link to source and signatures on apache dist server:

https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/


Please remember to TEST first before voting accordingly:
+1 = approve
+0 = no opinion
-1 = disapprove (provide reason)


Best Regards

Lai

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Pedro Larroy <pe...@gmail.com>.
Correction, I wanted to say:

1.5 is 33% faster than 1.4.1 when using hybridize without static_alloc
and static_shape.

We are claiming that static_alloc should improve speed and in this
case it makes it worse. Is that a blocker for the release?

Pedro.

On Tue, Jun 11, 2019 at 10:36 AM Pedro Larroy
<pe...@gmail.com> wrote:
>
> A bit more background into this:
>
> While tuning a model using LSTM and convolutions we find that using
> hybridize with static_alloc and static_shape is 15% slower in the
> latest revision vs in version 1.4.1 in which using hybridize with
> static_alloc and static_shape is 10% faster than without.
>
> Overwall we are still 33% faster when comparing master to 1.5.
>
> Let me know if you think this is a release blocker or not.
>
> Pedro.
>
> On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> <pe...@gmail.com> wrote:
> >
> > -1
> >
> > We found a performance regression vs 1.4 related to CachedOp which
> > affects Hybrid forward, which we are looking into.
> >
> > Pedro.
> >
> > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com> wrote:
> > >
> > > -1 (Tentatively until resolved)
> > >
> > > I tried to build MXNet 1.5.0 from source and pip install horovod but got
> > > the following error:
> > >
> > > Reproduce:
> > > 1) cp make/config.mk .
> > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > 3) make -j
> > >
> > > MXNet can build successfully.
> > >
> > > 4) pip install horovod
> > >
> > >
> > > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > fatal error: mkldnn_version.h: No such file or directory
> > >     compilation terminated.
> > >     INFO: Unable to build MXNet plugin, will skip it.
> > >
> > > I did not change any setting of MKLDNN in my config.mk. I am building on
> > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > >
> > > Thanks,
> > >
> > > Lin
> > >
> > >
> > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com> wrote:
> > >
> > > > +1
> > > >
> > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > This is the 3-day vote to release Apache MXNet (incubating) version
> > > > 1.5.0.
> > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and close on June 11,
> > > > > 23:59:59.
> > > > >
> > > > > 1) Link to release notes:
> > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > >
> > > > > 2) Link to release candidate:
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > >
> > > > > 3) Link to source and signatures on apache dist server:
> > > > >
> > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > >
> > > > >
> > > > > Please remember to TEST first before voting accordingly:
> > > > > +1 = approve
> > > > > +0 = no opinion
> > > > > -1 = disapprove (provide reason)
> > > > >
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Lai
> > > > >
> > > >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Aaron Markham <aa...@gmail.com>.
-1
There's an autogenerated file that doesn't get cleaned up in the
scala-package folder when you run make clean. This causes the scaladoc
step to fail. I'm putting in workaround messaging in the error message
and that'll go into master, but if anyone wants to specifically run
the scaladocs for 1.5.x, they're going to have a hard time. The
current error messaging is not helpful at all. You can get around it
by cloning fresh, which means no previously created files are in
there, but this isn't ideal for someone that has already been using
the repo and has scripts and other utilities all dialed in.
Zack's already working on a fix for this issue. If we're putting out
another RC anyway, then I'd vote to cherrypick Zack's fix so that docs
building works well.

Cheers,
Aaron

On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <ro...@gmail.com> wrote:
>
> Hi guys,
>
> Thanks for the updates. Currently, we are able to confirm Lin's issue with
> Horovod, and there is a fix pending. [1]
> Will update later today to see if we need to cancel this vote for the fix.
>
> As for the hybridize with static alloc performance regression. IMO it does
> not need to be a blocker if we have the following speed order.
> 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1 w/o static
> and it will be great to know the following to better make a decision on
> whether this should block the release.
> 1) if this is a model specific or a general regression.
> 2) if this is platform specific or general (w/ or w/o CUDA, w/ or w/o
> MKLDNN)
>
>
> [1]https://github.com/apache/incubator-mxnet/pull/15213
>
>
> Thanks
>
> Best Regards
>
> Lai
>
>
> On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zh...@apache.org> wrote:
>
> >
> >
> > On 2019/06/11 18:53:56, Pedro Larroy <pe...@gmail.com>
> > wrote:
> > > The stack trace doesn't seem to come from MXNet, do you have more info?
> > >
> > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zh...@apache.org> wrote:
> > > >
> > > >
> > > >
> > > > On 2019/06/11 17:36:09, Pedro Larroy <pe...@gmail.com>
> > wrote:
> > > > > A bit more background into this:
> > > > >
> > > > > While tuning a model using LSTM and convolutions we find that using
> > > > > hybridize with static_alloc and static_shape is 15% slower in the
> > > > > latest revision vs in version 1.4.1 in which using hybridize with
> > > > > static_alloc and static_shape is 10% faster than without.
> > > > >
> > > > > Overwall we are still 33% faster when comparing master to 1.5.
> > > > >
> > > > > Let me know if you think this is a release blocker or not.
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > <pe...@gmail.com> wrote:
> > > > > >
> > > > > > -1
> > > > > >
> > > > > > We found a performance regression vs 1.4 related to CachedOp which
> > > > > > affects Hybrid forward, which we are looking into.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > -1 (Tentatively until resolved)
> > > > > > >
> > > > > > > I tried to build MXNet 1.5.0 from source and pip install horovod
> > but got
> > > > > > > the following error:
> > > > > > >
> > > > > > > Reproduce:
> > > > > > > 1) cp make/config.mk .
> > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > > 3) make -j
> > > > > > >
> > > > > > > MXNet can build successfully.
> > > > > > >
> > > > > > > 4) pip install horovod
> > > > > > >
> > > > > > >
> > > > > > >
> > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > fatal error: mkldnn_version.h: No such file or directory
> > > > > > >     compilation terminated.
> > > > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > > > >
> > > > > > > I did not change any setting of MKLDNN in my config.mk. I am
> > building on
> > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Lin
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > > > > > >
> > > > > > > > > Dear MXNet community,
> > > > > > > > >
> > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> > version
> > > > > > > > 1.5.0.
> > > > > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and close
> > on June 11,
> > > > > > > > > 23:59:59.
> > > > > > > > >
> > > > > > > > > 1) Link to release notes:
> > > > > > > > >
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > >
> > > > > > > > > 2) Link to release candidate:
> > > > > > > > >
> > > > > > > > >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > >
> > > > > > > > > 3) Link to source and signatures on apache dist server:
> > > > > > > > >
> > > > > > > > >
> > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > > > > +1 = approve
> > > > > > > > > +0 = no opinion
> > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > >
> > > > > > > > > Lai
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > > > -1. Built from source, import mxnet in python cause Segfault.
> > > >
> > > > back trace:
> > > >
> > > > Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> > > > 0x00007fff3e8a9f20 in ?? ()
> > > > (gdb) bt
> > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> > from
> > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> > from
> > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > #4  0x00007ffff29d5c48 in ?? () from /usr/lib/python3/dist-packages/
> > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > > #7  0x000000000053fc97 in ?? ()
> > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > #9  0x000000000054a328 in ?? ()
> > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > > #12 0x000000000053fc97 in ?? ()
> > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > #18 0x00000000004ec2e3 in ?? ()
> > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > >
> > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with
> > USE_CUDA=1,
> > > > USE_CUDNN=1, the rest are default values.
> > > >
> > > > -Zhi
> > >
> >
> > Change to +1, I figured out that it was due to the dependencies. I still
> > have issue using DL base AMI with python3, but I will not regard it as a
> > blocker to 1.5 release.
> > Tested Gluon-CV training and works fine.
> >
> > -Zhi
> >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Sheng Zha <sz...@gmail.com>.
This vote has been closed. We will make another tag and start vote again.

-sz

> On Jun 18, 2019, at 5:24 PM, Lin Yuan <ap...@gmail.com> wrote:
> 
> With the PR https://github.com/apache/incubator-mxnet/pull/15213 I could
> verify that building Horovod is successful with MXNet built from source. So
> I will remove my pervious -1 vote.
> 
> Best,
> 
> Lin
> 
>> On Tue, Jun 18, 2019 at 2:10 PM Junru Shao <ju...@gmail.com> wrote:
>> 
>> Dear community,
>> 
>> I am happy to share some results with regard to commit 83d2c2d0e (PR
>> #14192, link: https://github.com/apache/incubator-mxnet/pull/14192) that
>> Pedro mentioned that causes regression.
>> 
>> First, using the exact model that Pedro provides, we did rigorous profiling
>> and found out that the PR #14192 slows it down by 7.26 ms (from 235.65 ms
>> to 242.91 ms).
>> 
>> Then, we submitted a following up PR #15262 (link:
>> https://github.com/apache/incubator-mxnet/pull/15262) to fix the
>> regression. By applying the patch to commit 83d2c2d0e, we could verify that
>> we get comparable performance. Please refer to the PR if you are interested
>> in our experiment.
>> 
>> That is to say, regression caused by commit 83d2c2d0e should have been
>> addressed. Please let me know if there is any future issues.
>> 
>> Thank you so much,
>> Junru
>> 
>> On Thu, Jun 13, 2019 at 3:05 PM Pedro Larroy <pedro.larroy.lists@gmail.com
>>> 
>> wrote:
>> 
>>> I reach you in private, the model is not public. We should be able to
>>> see this problem in a public model using LSTM I think.
>>> 
>>> 
>>> On Thu, Jun 13, 2019 at 11:15 AM Junru Shao <ju...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi Pedro,
>>>> 
>>>> Thanks for brining this up!
>>>> 
>>>> Could you provide your model so that we can dig into this?
>>>> 
>>>> Thanks,
>>>> Junru
>>>> 
>>>> On Thu, Jun 13, 2019 at 10:33 Pedro Larroy <
>> pedro.larroy.lists@gmail.com
>>>> 
>>>> wrote:
>>>> 
>>>>> I have isolated some of the commits that are causing performance
>>>>> regressions in wavenet like models:
>>>>> 
>>>>> Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils
>>>>> (#14192)
>>>>> 
>>>>> Causes a regression making hybridize with static slower using GPU
>>>>> inference.
>>>>> 
>>>>> [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use
>>>>> int64 as tensor size (#14570)
>>>>> 
>>>>> Causes overall regressions in CPU inference.
>>>>> 
>>>>> 
>>>>> Pedro.
>>>>> 
>>>>> On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <ro...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> Hi @dev,
>>>>>> 
>>>>>> I am canceling the vote as the issue Lin discovered require a
>> fix[1]
>>> and
>>>>>> the solution is not ready yet.
>>>>>> It's a general problem when building from source with MXNet, not
>> only
>>>>>> impacting horovod use cases.  Any help is appreciated.
>>>>>> 
>>>>>> Other issues we are tracking:
>>>>>> 1. Regression on hybridize with static_alloc. (not a blocker for
>> now)
>>>>>> 2. Scala doc issue [2], already merged in master, need to backport
>> to
>>>>> 1.5.x
>>>>>> 
>>>>>> Thanks for everyone's help! Please let us know if there is any
>> other
>>>>> issue
>>>>>> with 1.5.0
>>>>>> 
>>>>>> [1] https://github.com/apache/incubator-mxnet/pull/15213
>>>>>> [2] https://github.com/apache/incubator-mxnet/pull/15216
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Best Regards
>>>>>> 
>>>>>> Lai
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <
>>>>> pedro.larroy.lists@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
>>>>>>> 
>>>>>>> Looks like a general regression.
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <ro...@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>> Hi guys,
>>>>>>>> 
>>>>>>>> Thanks for the updates. Currently, we are able to confirm Lin's
>>> issue
>>>>>>> with
>>>>>>>> Horovod, and there is a fix pending. [1]
>>>>>>>> Will update later today to see if we need to cancel this vote
>>> for the
>>>>>>> fix.
>>>>>>>> 
>>>>>>>> As for the hybridize with static alloc performance regression.
>>> IMO it
>>>>>>> does
>>>>>>>> not need to be a blocker if we have the following speed order.
>>>>>>>> 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1
>> w/o
>>>>> static
>>>>>>>> and it will be great to know the following to better make a
>>> decision
>>>>> on
>>>>>>>> whether this should block the release.
>>>>>>>> 1) if this is a model specific or a general regression.
>>>>>>>> 2) if this is platform specific or general (w/ or w/o CUDA, w/
>>> or w/o
>>>>>>>> MKLDNN)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> [1]https://github.com/apache/incubator-mxnet/pull/15213
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> Best Regards
>>>>>>>> 
>>>>>>>> Lai
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <
>> zhreshold@apache.org>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 2019/06/11 18:53:56, Pedro Larroy <
>>> pedro.larroy.lists@gmail.com
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> The stack trace doesn't seem to come from MXNet, do you
>> have
>>> more
>>>>>>> info?
>>>>>>>>>> 
>>>>>>>>>> On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <
>>> zhreshold@apache.org
>>>>>> 
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 2019/06/11 17:36:09, Pedro Larroy <
>>>>> pedro.larroy.lists@gmail.com
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>> A bit more background into this:
>>>>>>>>>>>> 
>>>>>>>>>>>> While tuning a model using LSTM and convolutions we
>> find
>>> that
>>>>>>> using
>>>>>>>>>>>> hybridize with static_alloc and static_shape is 15%
>>> slower
>>>>> in the
>>>>>>>>>>>> latest revision vs in version 1.4.1 in which using
>>> hybridize
>>>>> with
>>>>>>>>>>>> static_alloc and static_shape is 10% faster than
>> without.
>>>>>>>>>>>> 
>>>>>>>>>>>> Overwall we are still 33% faster when comparing master
>> to
>>>>> 1.5.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know if you think this is a release blocker or
>>> not.
>>>>>>>>>>>> 
>>>>>>>>>>>> Pedro.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
>>>>>>>>>>>> <pe...@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We found a performance regression vs 1.4 related to
>>>>> CachedOp
>>>>>>> which
>>>>>>>>>>>>> affects Hybrid forward, which we are looking into.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Pedro.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <
>>>>> apeforest@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -1 (Tentatively until resolved)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I tried to build MXNet 1.5.0 from source and pip
>>> install
>>>>>>> horovod
>>>>>>>>> but got
>>>>>>>>>>>>>> the following error:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Reproduce:
>>>>>>>>>>>>>> 1) cp make/config.mk .
>>>>>>>>>>>>>> 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
>>>>>>>>>>>>>> 3) make -j
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> MXNet can build successfully.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 4) pip install horovod
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
>>>>>>>>>>>>>> fatal error: mkldnn_version.h: No such file or
>>> directory
>>>>>>>>>>>>>>    compilation terminated.
>>>>>>>>>>>>>>    INFO: Unable to build MXNet plugin, will skip
>> it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I did not change any setting of MKLDNN in my
>>> config.mk.
>>>>> I am
>>>>>>>>> building on
>>>>>>>>>>>>>> DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Lin
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <
>>>>>>> yajiedesign@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Lai Wei <ro...@gmail.com> 于2019年6月9日周日
>>> 上午4:12写道:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Dear MXNet community,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is the 3-day vote to release Apache MXNet
>>>>>>> (incubating)
>>>>>>>>> version
>>>>>>>>>>>>>>> 1.5.0.
>>>>>>>>>>>>>>>> Voting on dev@ will start June 8,
>>> 23:59:59(PST)  and
>>>>>>> close
>>>>>>>>> on June 11,
>>>>>>>>>>>>>>>> 23:59:59.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1) Link to release notes:
>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2) Link to release candidate:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 3) Link to source and signatures on apache dist
>>>>> server:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Please remember to TEST first before voting
>>>>> accordingly:
>>>>>>>>>>>>>>>> +1 = approve
>>>>>>>>>>>>>>>> +0 = no opinion
>>>>>>>>>>>>>>>> -1 = disapprove (provide reason)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Lai
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -1. Built from source, import mxnet in python cause
>>> Segfault.
>>>>>>>>>>> 
>>>>>>>>>>> back trace:
>>>>>>>>>>> 
>>>>>>>>>>> Thread 1 "python3" received signal SIGSEGV, Segmentation
>>> fault.
>>>>>>>>>>> 0x00007fff3e8a9f20 in ?? ()
>>>>>>>>>>> (gdb) bt
>>>>>>>>>>> #0  0x00007fff3e8a9f20 in ?? ()
>>>>>>>>>>> #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
>>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>,
>>>>>>>>>>> std::allocator<char> > const&, bool const&, unsigned int
>>>>> const&) ()
>>>>>>>>> from
>>>>>>>>>>> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
>>>>>>>>>>> #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
>>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>,
>>>>>>>>>>> std::allocator<char> > const&, bool const&, unsigned int
>>>>> const&) ()
>>>>>>>>> from
>>>>>>>>>>> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
>>>>>>>>>>> #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&)
>> ()
>>> from
>>>>>>>>>>> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
>>>>>>>>>>> #4  0x00007ffff29d5c48 in ?? () from
>>>>>>> /usr/lib/python3/dist-packages/
>>>>>>>>>>> apt_pkg.cpython-35m-x86_64-linux-gnu.so
>>>>>>>>>>> #5  0x00000000004ea10f in PyCFunction_Call ()
>>>>>>>>>>> #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #7  0x000000000053fc97 in ?? ()
>>>>>>>>>>> #8  0x00000000005409bf in PyEval_EvalCode ()
>>>>>>>>>>> #9  0x000000000054a328 in ?? ()
>>>>>>>>>>> #10 0x00000000004ea1c6 in PyCFunction_Call ()
>>>>>>>>>>> #11 0x000000000053d353 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #12 0x000000000053fc97 in ?? ()
>>>>>>>>>>> #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #14 0x000000000053b294 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #15 0x000000000053b294 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #16 0x000000000053b294 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
>>>>>>>>>>> #18 0x00000000004ec2e3 in ?? ()
>>>>>>>>>>> #19 0x00000000005c20e7 in PyObject_Call ()
>>>>>>>>>>> 
>>>>>>>>>>> I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built
>>> with
>>>>>>>>> USE_CUDA=1,
>>>>>>>>>>> USE_CUDNN=1, the rest are default values.
>>>>>>>>>>> 
>>>>>>>>>>> -Zhi
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Change to +1, I figured out that it was due to the
>>> dependencies. I
>>>>>>> still
>>>>>>>>> have issue using DL base AMI with python3, but I will not
>>> regard
>>>>> it as
>>>>>>> a
>>>>>>>>> blocker to 1.5 release.
>>>>>>>>> Tested Gluon-CV training and works fine.
>>>>>>>>> 
>>>>>>>>> -Zhi
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> 

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Lin Yuan <ap...@gmail.com>.
With the PR https://github.com/apache/incubator-mxnet/pull/15213 I could
verify that building Horovod is successful with MXNet built from source. So
I will remove my pervious -1 vote.

Best,

Lin

On Tue, Jun 18, 2019 at 2:10 PM Junru Shao <ju...@gmail.com> wrote:

> Dear community,
>
> I am happy to share some results with regard to commit 83d2c2d0e (PR
> #14192, link: https://github.com/apache/incubator-mxnet/pull/14192) that
> Pedro mentioned that causes regression.
>
> First, using the exact model that Pedro provides, we did rigorous profiling
> and found out that the PR #14192 slows it down by 7.26 ms (from 235.65 ms
> to 242.91 ms).
>
> Then, we submitted a following up PR #15262 (link:
> https://github.com/apache/incubator-mxnet/pull/15262) to fix the
> regression. By applying the patch to commit 83d2c2d0e, we could verify that
> we get comparable performance. Please refer to the PR if you are interested
> in our experiment.
>
> That is to say, regression caused by commit 83d2c2d0e should have been
> addressed. Please let me know if there is any future issues.
>
> Thank you so much,
> Junru
>
> On Thu, Jun 13, 2019 at 3:05 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > I reach you in private, the model is not public. We should be able to
> > see this problem in a public model using LSTM I think.
> >
> >
> > On Thu, Jun 13, 2019 at 11:15 AM Junru Shao <ju...@gmail.com>
> > wrote:
> > >
> > > Hi Pedro,
> > >
> > > Thanks for brining this up!
> > >
> > > Could you provide your model so that we can dig into this?
> > >
> > > Thanks,
> > > Junru
> > >
> > > On Thu, Jun 13, 2019 at 10:33 Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > > wrote:
> > >
> > > > I have isolated some of the commits that are causing performance
> > > > regressions in wavenet like models:
> > > >
> > > > Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils
> > > > (#14192)
> > > >
> > > > Causes a regression making hybridize with static slower using GPU
> > > > inference.
> > > >
> > > > [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use
> > > > int64 as tensor size (#14570)
> > > >
> > > > Causes overall regressions in CPU inference.
> > > >
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <ro...@gmail.com>
> wrote:
> > > > >
> > > > > Hi @dev,
> > > > >
> > > > > I am canceling the vote as the issue Lin discovered require a
> fix[1]
> > and
> > > > > the solution is not ready yet.
> > > > > It's a general problem when building from source with MXNet, not
> only
> > > > > impacting horovod use cases.  Any help is appreciated.
> > > > >
> > > > > Other issues we are tracking:
> > > > > 1. Regression on hybridize with static_alloc. (not a blocker for
> now)
> > > > > 2. Scala doc issue [2], already merged in master, need to backport
> to
> > > > 1.5.x
> > > > >
> > > > > Thanks for everyone's help! Please let us know if there is any
> other
> > > > issue
> > > > > with 1.5.0
> > > > >
> > > > > [1] https://github.com/apache/incubator-mxnet/pull/15213
> > > > > [2] https://github.com/apache/incubator-mxnet/pull/15216
> > > > >
> > > > >
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Lai
> > > > >
> > > > >
> > > > > On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
> > > > > >
> > > > > > Looks like a general regression.
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <ro...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > Hi guys,
> > > > > > >
> > > > > > > Thanks for the updates. Currently, we are able to confirm Lin's
> > issue
> > > > > > with
> > > > > > > Horovod, and there is a fix pending. [1]
> > > > > > > Will update later today to see if we need to cancel this vote
> > for the
> > > > > > fix.
> > > > > > >
> > > > > > > As for the hybridize with static alloc performance regression.
> > IMO it
> > > > > > does
> > > > > > > not need to be a blocker if we have the following speed order.
> > > > > > > 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1
> w/o
> > > > static
> > > > > > > and it will be great to know the following to better make a
> > decision
> > > > on
> > > > > > > whether this should block the release.
> > > > > > > 1) if this is a model specific or a general regression.
> > > > > > > 2) if this is platform specific or general (w/ or w/o CUDA, w/
> > or w/o
> > > > > > > MKLDNN)
> > > > > > >
> > > > > > >
> > > > > > > [1]https://github.com/apache/incubator-mxnet/pull/15213
> > > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > Best Regards
> > > > > > >
> > > > > > > Lai
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <
> zhreshold@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 2019/06/11 18:53:56, Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > > >
> > > > > > > > wrote:
> > > > > > > > > The stack trace doesn't seem to come from MXNet, do you
> have
> > more
> > > > > > info?
> > > > > > > > >
> > > > > > > > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <
> > zhreshold@apache.org
> > > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 2019/06/11 17:36:09, Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > A bit more background into this:
> > > > > > > > > > >
> > > > > > > > > > > While tuning a model using LSTM and convolutions we
> find
> > that
> > > > > > using
> > > > > > > > > > > hybridize with static_alloc and static_shape is 15%
> > slower
> > > > in the
> > > > > > > > > > > latest revision vs in version 1.4.1 in which using
> > hybridize
> > > > with
> > > > > > > > > > > static_alloc and static_shape is 10% faster than
> without.
> > > > > > > > > > >
> > > > > > > > > > > Overwall we are still 33% faster when comparing master
> to
> > > > 1.5.
> > > > > > > > > > >
> > > > > > > > > > > Let me know if you think this is a release blocker or
> > not.
> > > > > > > > > > >
> > > > > > > > > > > Pedro.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > > > > > > > <pe...@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > -1
> > > > > > > > > > > >
> > > > > > > > > > > > We found a performance regression vs 1.4 related to
> > > > CachedOp
> > > > > > which
> > > > > > > > > > > > affects Hybrid forward, which we are looking into.
> > > > > > > > > > > >
> > > > > > > > > > > > Pedro.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <
> > > > apeforest@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > -1 (Tentatively until resolved)
> > > > > > > > > > > > >
> > > > > > > > > > > > > I tried to build MXNet 1.5.0 from source and pip
> > install
> > > > > > horovod
> > > > > > > > but got
> > > > > > > > > > > > > the following error:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Reproduce:
> > > > > > > > > > > > > 1) cp make/config.mk .
> > > > > > > > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > > > > > > > > 3) make -j
> > > > > > > > > > > > >
> > > > > > > > > > > > > MXNet can build successfully.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4) pip install horovod
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > > > > > > > fatal error: mkldnn_version.h: No such file or
> > directory
> > > > > > > > > > > > >     compilation terminated.
> > > > > > > > > > > > >     INFO: Unable to build MXNet plugin, will skip
> it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I did not change any setting of MKLDNN in my
> > config.mk.
> > > > I am
> > > > > > > > building on
> > > > > > > > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lin
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <
> > > > > > yajiedesign@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > +1
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日
> > 上午4:12写道:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet
> > > > > > (incubating)
> > > > > > > > version
> > > > > > > > > > > > > > 1.5.0.
> > > > > > > > > > > > > > > Voting on dev@ will start June 8,
> > 23:59:59(PST)  and
> > > > > > close
> > > > > > > > on June 11,
> > > > > > > > > > > > > > > 23:59:59.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1) Link to release notes:
> > > > > > > > > > > > > > >
> > > > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2) Link to release candidate:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist
> > > > server:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > >
> > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Please remember to TEST first before voting
> > > > accordingly:
> > > > > > > > > > > > > > > +1 = approve
> > > > > > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Lai
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > -1. Built from source, import mxnet in python cause
> > Segfault.
> > > > > > > > > >
> > > > > > > > > > back trace:
> > > > > > > > > >
> > > > > > > > > > Thread 1 "python3" received signal SIGSEGV, Segmentation
> > fault.
> > > > > > > > > > 0x00007fff3e8a9f20 in ?? ()
> > > > > > > > > > (gdb) bt
> > > > > > > > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > > > > > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> > > > const&) ()
> > > > > > > > from
> > > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> > > > const&) ()
> > > > > > > > from
> > > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&)
> ()
> > from
> > > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > > > #4  0x00007ffff29d5c48 in ?? () from
> > > > > > /usr/lib/python3/dist-packages/
> > > > > > > > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > > > > > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > > > > > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > > > > > > > > #7  0x000000000053fc97 in ?? ()
> > > > > > > > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > > > > > > > #9  0x000000000054a328 in ?? ()
> > > > > > > > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > > > > > > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > > > > > > > > #12 0x000000000053fc97 in ?? ()
> > > > > > > > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > > > > > > > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > > > > > > > #18 0x00000000004ec2e3 in ?? ()
> > > > > > > > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > > > > > > > >
> > > > > > > > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built
> > with
> > > > > > > > USE_CUDA=1,
> > > > > > > > > > USE_CUDNN=1, the rest are default values.
> > > > > > > > > >
> > > > > > > > > > -Zhi
> > > > > > > > >
> > > > > > > >
> > > > > > > > Change to +1, I figured out that it was due to the
> > dependencies. I
> > > > > > still
> > > > > > > > have issue using DL base AMI with python3, but I will not
> > regard
> > > > it as
> > > > > > a
> > > > > > > > blocker to 1.5 release.
> > > > > > > > Tested Gluon-CV training and works fine.
> > > > > > > >
> > > > > > > > -Zhi
> > > > > > > >
> > > > > >
> > > >
> >
>

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Junru Shao <ju...@gmail.com>.
Dear community,

I am happy to share some results with regard to commit 83d2c2d0e (PR
#14192, link: https://github.com/apache/incubator-mxnet/pull/14192) that
Pedro mentioned that causes regression.

First, using the exact model that Pedro provides, we did rigorous profiling
and found out that the PR #14192 slows it down by 7.26 ms (from 235.65 ms
to 242.91 ms).

Then, we submitted a following up PR #15262 (link:
https://github.com/apache/incubator-mxnet/pull/15262) to fix the
regression. By applying the patch to commit 83d2c2d0e, we could verify that
we get comparable performance. Please refer to the PR if you are interested
in our experiment.

That is to say, regression caused by commit 83d2c2d0e should have been
addressed. Please let me know if there is any future issues.

Thank you so much,
Junru

On Thu, Jun 13, 2019 at 3:05 PM Pedro Larroy <pe...@gmail.com>
wrote:

> I reach you in private, the model is not public. We should be able to
> see this problem in a public model using LSTM I think.
>
>
> On Thu, Jun 13, 2019 at 11:15 AM Junru Shao <ju...@gmail.com>
> wrote:
> >
> > Hi Pedro,
> >
> > Thanks for brining this up!
> >
> > Could you provide your model so that we can dig into this?
> >
> > Thanks,
> > Junru
> >
> > On Thu, Jun 13, 2019 at 10:33 Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> > wrote:
> >
> > > I have isolated some of the commits that are causing performance
> > > regressions in wavenet like models:
> > >
> > > Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils
> > > (#14192)
> > >
> > > Causes a regression making hybridize with static slower using GPU
> > > inference.
> > >
> > > [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use
> > > int64 as tensor size (#14570)
> > >
> > > Causes overall regressions in CPU inference.
> > >
> > >
> > > Pedro.
> > >
> > > On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <ro...@gmail.com> wrote:
> > > >
> > > > Hi @dev,
> > > >
> > > > I am canceling the vote as the issue Lin discovered require a fix[1]
> and
> > > > the solution is not ready yet.
> > > > It's a general problem when building from source with MXNet, not only
> > > > impacting horovod use cases.  Any help is appreciated.
> > > >
> > > > Other issues we are tracking:
> > > > 1. Regression on hybridize with static_alloc. (not a blocker for now)
> > > > 2. Scala doc issue [2], already merged in master, need to backport to
> > > 1.5.x
> > > >
> > > > Thanks for everyone's help! Please let us know if there is any other
> > > issue
> > > > with 1.5.0
> > > >
> > > > [1] https://github.com/apache/incubator-mxnet/pull/15213
> > > > [2] https://github.com/apache/incubator-mxnet/pull/15216
> > > >
> > > >
> > > >
> > > > Best Regards
> > > >
> > > > Lai
> > > >
> > > >
> > > > On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com>
> > > > wrote:
> > > >
> > > > > Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
> > > > >
> > > > > Looks like a general regression.
> > > > >
> > > > >
> > > > > On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <ro...@gmail.com>
> wrote:
> > > > > >
> > > > > > Hi guys,
> > > > > >
> > > > > > Thanks for the updates. Currently, we are able to confirm Lin's
> issue
> > > > > with
> > > > > > Horovod, and there is a fix pending. [1]
> > > > > > Will update later today to see if we need to cancel this vote
> for the
> > > > > fix.
> > > > > >
> > > > > > As for the hybridize with static alloc performance regression.
> IMO it
> > > > > does
> > > > > > not need to be a blocker if we have the following speed order.
> > > > > > 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1 w/o
> > > static
> > > > > > and it will be great to know the following to better make a
> decision
> > > on
> > > > > > whether this should block the release.
> > > > > > 1) if this is a model specific or a general regression.
> > > > > > 2) if this is platform specific or general (w/ or w/o CUDA, w/
> or w/o
> > > > > > MKLDNN)
> > > > > >
> > > > > >
> > > > > > [1]https://github.com/apache/incubator-mxnet/pull/15213
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Best Regards
> > > > > >
> > > > > > Lai
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zh...@apache.org>
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 2019/06/11 18:53:56, Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > > >
> > > > > > > wrote:
> > > > > > > > The stack trace doesn't seem to come from MXNet, do you have
> more
> > > > > info?
> > > > > > > >
> > > > > > > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <
> zhreshold@apache.org
> > > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 2019/06/11 17:36:09, Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > > > > A bit more background into this:
> > > > > > > > > >
> > > > > > > > > > While tuning a model using LSTM and convolutions we find
> that
> > > > > using
> > > > > > > > > > hybridize with static_alloc and static_shape is 15%
> slower
> > > in the
> > > > > > > > > > latest revision vs in version 1.4.1 in which using
> hybridize
> > > with
> > > > > > > > > > static_alloc and static_shape is 10% faster than without.
> > > > > > > > > >
> > > > > > > > > > Overwall we are still 33% faster when comparing master to
> > > 1.5.
> > > > > > > > > >
> > > > > > > > > > Let me know if you think this is a release blocker or
> not.
> > > > > > > > > >
> > > > > > > > > > Pedro.
> > > > > > > > > >
> > > > > > > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > > > > > > <pe...@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > -1
> > > > > > > > > > >
> > > > > > > > > > > We found a performance regression vs 1.4 related to
> > > CachedOp
> > > > > which
> > > > > > > > > > > affects Hybrid forward, which we are looking into.
> > > > > > > > > > >
> > > > > > > > > > > Pedro.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <
> > > apeforest@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > -1 (Tentatively until resolved)
> > > > > > > > > > > >
> > > > > > > > > > > > I tried to build MXNet 1.5.0 from source and pip
> install
> > > > > horovod
> > > > > > > but got
> > > > > > > > > > > > the following error:
> > > > > > > > > > > >
> > > > > > > > > > > > Reproduce:
> > > > > > > > > > > > 1) cp make/config.mk .
> > > > > > > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > > > > > > > 3) make -j
> > > > > > > > > > > >
> > > > > > > > > > > > MXNet can build successfully.
> > > > > > > > > > > >
> > > > > > > > > > > > 4) pip install horovod
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > >
> > > > >
> > >
> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > > > > > > fatal error: mkldnn_version.h: No such file or
> directory
> > > > > > > > > > > >     compilation terminated.
> > > > > > > > > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > > > > > > > > >
> > > > > > > > > > > > I did not change any setting of MKLDNN in my
> config.mk.
> > > I am
> > > > > > > building on
> > > > > > > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Lin
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <
> > > > > yajiedesign@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > +1
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日
> 上午4:12写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet
> > > > > (incubating)
> > > > > > > version
> > > > > > > > > > > > > 1.5.0.
> > > > > > > > > > > > > > Voting on dev@ will start June 8,
> 23:59:59(PST)  and
> > > > > close
> > > > > > > on June 11,
> > > > > > > > > > > > > > 23:59:59.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1) Link to release notes:
> > > > > > > > > > > > > >
> > > > > > >
> > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2) Link to release candidate:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > >
> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 3) Link to source and signatures on apache dist
> > > server:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > >
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please remember to TEST first before voting
> > > accordingly:
> > > > > > > > > > > > > > +1 = approve
> > > > > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Lai
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > -1. Built from source, import mxnet in python cause
> Segfault.
> > > > > > > > >
> > > > > > > > > back trace:
> > > > > > > > >
> > > > > > > > > Thread 1 "python3" received signal SIGSEGV, Segmentation
> fault.
> > > > > > > > > 0x00007fff3e8a9f20 in ?? ()
> > > > > > > > > (gdb) bt
> > > > > > > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > > > > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> > > const&) ()
> > > > > > > from
> > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> > > const&) ()
> > > > > > > from
> > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) ()
> from
> > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > > #4  0x00007ffff29d5c48 in ?? () from
> > > > > /usr/lib/python3/dist-packages/
> > > > > > > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > > > > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > > > > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > > > > > > > #7  0x000000000053fc97 in ?? ()
> > > > > > > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > > > > > > #9  0x000000000054a328 in ?? ()
> > > > > > > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > > > > > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > > > > > > > #12 0x000000000053fc97 in ?? ()
> > > > > > > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > > > > > > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > > > > > > #18 0x00000000004ec2e3 in ?? ()
> > > > > > > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > > > > > > >
> > > > > > > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built
> with
> > > > > > > USE_CUDA=1,
> > > > > > > > > USE_CUDNN=1, the rest are default values.
> > > > > > > > >
> > > > > > > > > -Zhi
> > > > > > > >
> > > > > > >
> > > > > > > Change to +1, I figured out that it was due to the
> dependencies. I
> > > > > still
> > > > > > > have issue using DL base AMI with python3, but I will not
> regard
> > > it as
> > > > > a
> > > > > > > blocker to 1.5 release.
> > > > > > > Tested Gluon-CV training and works fine.
> > > > > > >
> > > > > > > -Zhi
> > > > > > >
> > > > >
> > >
>

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Pedro Larroy <pe...@gmail.com>.
I reach you in private, the model is not public. We should be able to
see this problem in a public model using LSTM I think.


On Thu, Jun 13, 2019 at 11:15 AM Junru Shao <ju...@gmail.com> wrote:
>
> Hi Pedro,
>
> Thanks for brining this up!
>
> Could you provide your model so that we can dig into this?
>
> Thanks,
> Junru
>
> On Thu, Jun 13, 2019 at 10:33 Pedro Larroy <pe...@gmail.com>
> wrote:
>
> > I have isolated some of the commits that are causing performance
> > regressions in wavenet like models:
> >
> > Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils
> > (#14192)
> >
> > Causes a regression making hybridize with static slower using GPU
> > inference.
> >
> > [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use
> > int64 as tensor size (#14570)
> >
> > Causes overall regressions in CPU inference.
> >
> >
> > Pedro.
> >
> > On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <ro...@gmail.com> wrote:
> > >
> > > Hi @dev,
> > >
> > > I am canceling the vote as the issue Lin discovered require a fix[1] and
> > > the solution is not ready yet.
> > > It's a general problem when building from source with MXNet, not only
> > > impacting horovod use cases.  Any help is appreciated.
> > >
> > > Other issues we are tracking:
> > > 1. Regression on hybridize with static_alloc. (not a blocker for now)
> > > 2. Scala doc issue [2], already merged in master, need to backport to
> > 1.5.x
> > >
> > > Thanks for everyone's help! Please let us know if there is any other
> > issue
> > > with 1.5.0
> > >
> > > [1] https://github.com/apache/incubator-mxnet/pull/15213
> > > [2] https://github.com/apache/incubator-mxnet/pull/15216
> > >
> > >
> > >
> > > Best Regards
> > >
> > > Lai
> > >
> > >
> > > On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> > > wrote:
> > >
> > > > Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
> > > >
> > > > Looks like a general regression.
> > > >
> > > >
> > > > On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <ro...@gmail.com> wrote:
> > > > >
> > > > > Hi guys,
> > > > >
> > > > > Thanks for the updates. Currently, we are able to confirm Lin's issue
> > > > with
> > > > > Horovod, and there is a fix pending. [1]
> > > > > Will update later today to see if we need to cancel this vote for the
> > > > fix.
> > > > >
> > > > > As for the hybridize with static alloc performance regression. IMO it
> > > > does
> > > > > not need to be a blocker if we have the following speed order.
> > > > > 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1 w/o
> > static
> > > > > and it will be great to know the following to better make a decision
> > on
> > > > > whether this should block the release.
> > > > > 1) if this is a model specific or a general regression.
> > > > > 2) if this is platform specific or general (w/ or w/o CUDA, w/ or w/o
> > > > > MKLDNN)
> > > > >
> > > > >
> > > > > [1]https://github.com/apache/incubator-mxnet/pull/15213
> > > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Lai
> > > > >
> > > > >
> > > > > On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zh...@apache.org>
> > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On 2019/06/11 18:53:56, Pedro Larroy <pedro.larroy.lists@gmail.com
> > >
> > > > > > wrote:
> > > > > > > The stack trace doesn't seem to come from MXNet, do you have more
> > > > info?
> > > > > > >
> > > > > > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zhreshold@apache.org
> > >
> > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 2019/06/11 17:36:09, Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > > > > A bit more background into this:
> > > > > > > > >
> > > > > > > > > While tuning a model using LSTM and convolutions we find that
> > > > using
> > > > > > > > > hybridize with static_alloc and static_shape is 15% slower
> > in the
> > > > > > > > > latest revision vs in version 1.4.1 in which using hybridize
> > with
> > > > > > > > > static_alloc and static_shape is 10% faster than without.
> > > > > > > > >
> > > > > > > > > Overwall we are still 33% faster when comparing master to
> > 1.5.
> > > > > > > > >
> > > > > > > > > Let me know if you think this is a release blocker or not.
> > > > > > > > >
> > > > > > > > > Pedro.
> > > > > > > > >
> > > > > > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > > > > > <pe...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > -1
> > > > > > > > > >
> > > > > > > > > > We found a performance regression vs 1.4 related to
> > CachedOp
> > > > which
> > > > > > > > > > affects Hybrid forward, which we are looking into.
> > > > > > > > > >
> > > > > > > > > > Pedro.
> > > > > > > > > >
> > > > > > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <
> > apeforest@gmail.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > -1 (Tentatively until resolved)
> > > > > > > > > > >
> > > > > > > > > > > I tried to build MXNet 1.5.0 from source and pip install
> > > > horovod
> > > > > > but got
> > > > > > > > > > > the following error:
> > > > > > > > > > >
> > > > > > > > > > > Reproduce:
> > > > > > > > > > > 1) cp make/config.mk .
> > > > > > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > > > > > > 3) make -j
> > > > > > > > > > >
> > > > > > > > > > > MXNet can build successfully.
> > > > > > > > > > >
> > > > > > > > > > > 4) pip install horovod
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > >
> > > >
> > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > > > > > fatal error: mkldnn_version.h: No such file or directory
> > > > > > > > > > >     compilation terminated.
> > > > > > > > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > > > > > > > >
> > > > > > > > > > > I did not change any setting of MKLDNN in my config.mk.
> > I am
> > > > > > building on
> > > > > > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Lin
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <
> > > > yajiedesign@gmail.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1
> > > > > > > > > > > >
> > > > > > > > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > > > > > > > > > >
> > > > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet
> > > > (incubating)
> > > > > > version
> > > > > > > > > > > > 1.5.0.
> > > > > > > > > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and
> > > > close
> > > > > > on June 11,
> > > > > > > > > > > > > 23:59:59.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1) Link to release notes:
> > > > > > > > > > > > >
> > > > > >
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2) Link to release candidate:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > > > > > >
> > > > > > > > > > > > > 3) Link to source and signatures on apache dist
> > server:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please remember to TEST first before voting
> > accordingly:
> > > > > > > > > > > > > +1 = approve
> > > > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lai
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > -1. Built from source, import mxnet in python cause Segfault.
> > > > > > > >
> > > > > > > > back trace:
> > > > > > > >
> > > > > > > > Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> > > > > > > > 0x00007fff3e8a9f20 in ?? ()
> > > > > > > > (gdb) bt
> > > > > > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > > > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> > const&) ()
> > > > > > from
> > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> > const&) ()
> > > > > > from
> > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > #4  0x00007ffff29d5c48 in ?? () from
> > > > /usr/lib/python3/dist-packages/
> > > > > > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > > > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > > > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > > > > > > #7  0x000000000053fc97 in ?? ()
> > > > > > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > > > > > #9  0x000000000054a328 in ?? ()
> > > > > > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > > > > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > > > > > > #12 0x000000000053fc97 in ?? ()
> > > > > > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > > > > > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > > > > > #18 0x00000000004ec2e3 in ?? ()
> > > > > > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > > > > > >
> > > > > > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with
> > > > > > USE_CUDA=1,
> > > > > > > > USE_CUDNN=1, the rest are default values.
> > > > > > > >
> > > > > > > > -Zhi
> > > > > > >
> > > > > >
> > > > > > Change to +1, I figured out that it was due to the dependencies. I
> > > > still
> > > > > > have issue using DL base AMI with python3, but I will not regard
> > it as
> > > > a
> > > > > > blocker to 1.5 release.
> > > > > > Tested Gluon-CV training and works fine.
> > > > > >
> > > > > > -Zhi
> > > > > >
> > > >
> >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Junru Shao <ju...@gmail.com>.
Hi Pedro,

Thanks for brining this up!

Could you provide your model so that we can dig into this?

Thanks,
Junru

On Thu, Jun 13, 2019 at 10:33 Pedro Larroy <pe...@gmail.com>
wrote:

> I have isolated some of the commits that are causing performance
> regressions in wavenet like models:
>
> Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils
> (#14192)
>
> Causes a regression making hybridize with static slower using GPU
> inference.
>
> [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use
> int64 as tensor size (#14570)
>
> Causes overall regressions in CPU inference.
>
>
> Pedro.
>
> On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <ro...@gmail.com> wrote:
> >
> > Hi @dev,
> >
> > I am canceling the vote as the issue Lin discovered require a fix[1] and
> > the solution is not ready yet.
> > It's a general problem when building from source with MXNet, not only
> > impacting horovod use cases.  Any help is appreciated.
> >
> > Other issues we are tracking:
> > 1. Regression on hybridize with static_alloc. (not a blocker for now)
> > 2. Scala doc issue [2], already merged in master, need to backport to
> 1.5.x
> >
> > Thanks for everyone's help! Please let us know if there is any other
> issue
> > with 1.5.0
> >
> > [1] https://github.com/apache/incubator-mxnet/pull/15213
> > [2] https://github.com/apache/incubator-mxnet/pull/15216
> >
> >
> >
> > Best Regards
> >
> > Lai
> >
> >
> > On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> > wrote:
> >
> > > Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
> > >
> > > Looks like a general regression.
> > >
> > >
> > > On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <ro...@gmail.com> wrote:
> > > >
> > > > Hi guys,
> > > >
> > > > Thanks for the updates. Currently, we are able to confirm Lin's issue
> > > with
> > > > Horovod, and there is a fix pending. [1]
> > > > Will update later today to see if we need to cancel this vote for the
> > > fix.
> > > >
> > > > As for the hybridize with static alloc performance regression. IMO it
> > > does
> > > > not need to be a blocker if we have the following speed order.
> > > > 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1 w/o
> static
> > > > and it will be great to know the following to better make a decision
> on
> > > > whether this should block the release.
> > > > 1) if this is a model specific or a general regression.
> > > > 2) if this is platform specific or general (w/ or w/o CUDA, w/ or w/o
> > > > MKLDNN)
> > > >
> > > >
> > > > [1]https://github.com/apache/incubator-mxnet/pull/15213
> > > >
> > > >
> > > > Thanks
> > > >
> > > > Best Regards
> > > >
> > > > Lai
> > > >
> > > >
> > > > On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zh...@apache.org>
> wrote:
> > > >
> > > > >
> > > > >
> > > > > On 2019/06/11 18:53:56, Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> > > > > wrote:
> > > > > > The stack trace doesn't seem to come from MXNet, do you have more
> > > info?
> > > > > >
> > > > > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zhreshold@apache.org
> >
> > > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 2019/06/11 17:36:09, Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > > >
> > > > > wrote:
> > > > > > > > A bit more background into this:
> > > > > > > >
> > > > > > > > While tuning a model using LSTM and convolutions we find that
> > > using
> > > > > > > > hybridize with static_alloc and static_shape is 15% slower
> in the
> > > > > > > > latest revision vs in version 1.4.1 in which using hybridize
> with
> > > > > > > > static_alloc and static_shape is 10% faster than without.
> > > > > > > >
> > > > > > > > Overwall we are still 33% faster when comparing master to
> 1.5.
> > > > > > > >
> > > > > > > > Let me know if you think this is a release blocker or not.
> > > > > > > >
> > > > > > > > Pedro.
> > > > > > > >
> > > > > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > > > > <pe...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > -1
> > > > > > > > >
> > > > > > > > > We found a performance regression vs 1.4 related to
> CachedOp
> > > which
> > > > > > > > > affects Hybrid forward, which we are looking into.
> > > > > > > > >
> > > > > > > > > Pedro.
> > > > > > > > >
> > > > > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <
> apeforest@gmail.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > -1 (Tentatively until resolved)
> > > > > > > > > >
> > > > > > > > > > I tried to build MXNet 1.5.0 from source and pip install
> > > horovod
> > > > > but got
> > > > > > > > > > the following error:
> > > > > > > > > >
> > > > > > > > > > Reproduce:
> > > > > > > > > > 1) cp make/config.mk .
> > > > > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > > > > > 3) make -j
> > > > > > > > > >
> > > > > > > > > > MXNet can build successfully.
> > > > > > > > > >
> > > > > > > > > > 4) pip install horovod
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > >
> > >
> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > > > > fatal error: mkldnn_version.h: No such file or directory
> > > > > > > > > >     compilation terminated.
> > > > > > > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > > > > > > >
> > > > > > > > > > I did not change any setting of MKLDNN in my config.mk.
> I am
> > > > > building on
> > > > > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Lin
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <
> > > yajiedesign@gmail.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1
> > > > > > > > > > >
> > > > > > > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > > > > > > > > >
> > > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > > >
> > > > > > > > > > > > This is the 3-day vote to release Apache MXNet
> > > (incubating)
> > > > > version
> > > > > > > > > > > 1.5.0.
> > > > > > > > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and
> > > close
> > > > > on June 11,
> > > > > > > > > > > > 23:59:59.
> > > > > > > > > > > >
> > > > > > > > > > > > 1) Link to release notes:
> > > > > > > > > > > >
> > > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > > > > >
> > > > > > > > > > > > 2) Link to release candidate:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > > > > >
> > > > > > > > > > > > 3) Link to source and signatures on apache dist
> server:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Please remember to TEST first before voting
> accordingly:
> > > > > > > > > > > > +1 = approve
> > > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Best Regards
> > > > > > > > > > > >
> > > > > > > > > > > > Lai
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > -1. Built from source, import mxnet in python cause Segfault.
> > > > > > >
> > > > > > > back trace:
> > > > > > >
> > > > > > > Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> > > > > > > 0x00007fff3e8a9f20 in ?? ()
> > > > > > > (gdb) bt
> > > > > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> const&) ()
> > > > > from
> > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> const&) ()
> > > > > from
> > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > #4  0x00007ffff29d5c48 in ?? () from
> > > /usr/lib/python3/dist-packages/
> > > > > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > > > > > #7  0x000000000053fc97 in ?? ()
> > > > > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > > > > #9  0x000000000054a328 in ?? ()
> > > > > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > > > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > > > > > #12 0x000000000053fc97 in ?? ()
> > > > > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > > > > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > > > > #18 0x00000000004ec2e3 in ?? ()
> > > > > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > > > > >
> > > > > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with
> > > > > USE_CUDA=1,
> > > > > > > USE_CUDNN=1, the rest are default values.
> > > > > > >
> > > > > > > -Zhi
> > > > > >
> > > > >
> > > > > Change to +1, I figured out that it was due to the dependencies. I
> > > still
> > > > > have issue using DL base AMI with python3, but I will not regard
> it as
> > > a
> > > > > blocker to 1.5 release.
> > > > > Tested Gluon-CV training and works fine.
> > > > >
> > > > > -Zhi
> > > > >
> > >
>

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Pedro Larroy <pe...@gmail.com>.
I have isolated some of the commits that are causing performance
regressions in wavenet like models:

Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils (#14192)

Causes a regression making hybridize with static slower using GPU inference.

[0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use
int64 as tensor size (#14570)

Causes overall regressions in CPU inference.


Pedro.

On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <ro...@gmail.com> wrote:
>
> Hi @dev,
>
> I am canceling the vote as the issue Lin discovered require a fix[1] and
> the solution is not ready yet.
> It's a general problem when building from source with MXNet, not only
> impacting horovod use cases.  Any help is appreciated.
>
> Other issues we are tracking:
> 1. Regression on hybridize with static_alloc. (not a blocker for now)
> 2. Scala doc issue [2], already merged in master, need to backport to 1.5.x
>
> Thanks for everyone's help! Please let us know if there is any other issue
> with 1.5.0
>
> [1] https://github.com/apache/incubator-mxnet/pull/15213
> [2] https://github.com/apache/incubator-mxnet/pull/15216
>
>
>
> Best Regards
>
> Lai
>
>
> On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <pe...@gmail.com>
> wrote:
>
> > Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
> >
> > Looks like a general regression.
> >
> >
> > On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <ro...@gmail.com> wrote:
> > >
> > > Hi guys,
> > >
> > > Thanks for the updates. Currently, we are able to confirm Lin's issue
> > with
> > > Horovod, and there is a fix pending. [1]
> > > Will update later today to see if we need to cancel this vote for the
> > fix.
> > >
> > > As for the hybridize with static alloc performance regression. IMO it
> > does
> > > not need to be a blocker if we have the following speed order.
> > > 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1 w/o static
> > > and it will be great to know the following to better make a decision on
> > > whether this should block the release.
> > > 1) if this is a model specific or a general regression.
> > > 2) if this is platform specific or general (w/ or w/o CUDA, w/ or w/o
> > > MKLDNN)
> > >
> > >
> > > [1]https://github.com/apache/incubator-mxnet/pull/15213
> > >
> > >
> > > Thanks
> > >
> > > Best Regards
> > >
> > > Lai
> > >
> > >
> > > On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zh...@apache.org> wrote:
> > >
> > > >
> > > >
> > > > On 2019/06/11 18:53:56, Pedro Larroy <pe...@gmail.com>
> > > > wrote:
> > > > > The stack trace doesn't seem to come from MXNet, do you have more
> > info?
> > > > >
> > > > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zh...@apache.org>
> > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 2019/06/11 17:36:09, Pedro Larroy <pedro.larroy.lists@gmail.com
> > >
> > > > wrote:
> > > > > > > A bit more background into this:
> > > > > > >
> > > > > > > While tuning a model using LSTM and convolutions we find that
> > using
> > > > > > > hybridize with static_alloc and static_shape is 15% slower in the
> > > > > > > latest revision vs in version 1.4.1 in which using hybridize with
> > > > > > > static_alloc and static_shape is 10% faster than without.
> > > > > > >
> > > > > > > Overwall we are still 33% faster when comparing master to 1.5.
> > > > > > >
> > > > > > > Let me know if you think this is a release blocker or not.
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > > > <pe...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > -1
> > > > > > > >
> > > > > > > > We found a performance regression vs 1.4 related to CachedOp
> > which
> > > > > > > > affects Hybrid forward, which we are looking into.
> > > > > > > >
> > > > > > > > Pedro.
> > > > > > > >
> > > > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > -1 (Tentatively until resolved)
> > > > > > > > >
> > > > > > > > > I tried to build MXNet 1.5.0 from source and pip install
> > horovod
> > > > but got
> > > > > > > > > the following error:
> > > > > > > > >
> > > > > > > > > Reproduce:
> > > > > > > > > 1) cp make/config.mk .
> > > > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > > > > 3) make -j
> > > > > > > > >
> > > > > > > > > MXNet can build successfully.
> > > > > > > > >
> > > > > > > > > 4) pip install horovod
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > >
> > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > > > fatal error: mkldnn_version.h: No such file or directory
> > > > > > > > >     compilation terminated.
> > > > > > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > > > > > >
> > > > > > > > > I did not change any setting of MKLDNN in my config.mk. I am
> > > > building on
> > > > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Lin
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <
> > yajiedesign@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1
> > > > > > > > > >
> > > > > > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > > > > > > > >
> > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > >
> > > > > > > > > > > This is the 3-day vote to release Apache MXNet
> > (incubating)
> > > > version
> > > > > > > > > > 1.5.0.
> > > > > > > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and
> > close
> > > > on June 11,
> > > > > > > > > > > 23:59:59.
> > > > > > > > > > >
> > > > > > > > > > > 1) Link to release notes:
> > > > > > > > > > >
> > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > > > >
> > > > > > > > > > > 2) Link to release candidate:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > > > >
> > > > > > > > > > > 3) Link to source and signatures on apache dist server:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > > > > > > +1 = approve
> > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best Regards
> > > > > > > > > > >
> > > > > > > > > > > Lai
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > > > > -1. Built from source, import mxnet in python cause Segfault.
> > > > > >
> > > > > > back trace:
> > > > > >
> > > > > > Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> > > > > > 0x00007fff3e8a9f20 in ?? ()
> > > > > > (gdb) bt
> > > > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> > > > from
> > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> > > > from
> > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > #4  0x00007ffff29d5c48 in ?? () from
> > /usr/lib/python3/dist-packages/
> > > > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > > > > #7  0x000000000053fc97 in ?? ()
> > > > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > > > #9  0x000000000054a328 in ?? ()
> > > > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > > > > #12 0x000000000053fc97 in ?? ()
> > > > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > > > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > > > #18 0x00000000004ec2e3 in ?? ()
> > > > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > > > >
> > > > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with
> > > > USE_CUDA=1,
> > > > > > USE_CUDNN=1, the rest are default values.
> > > > > >
> > > > > > -Zhi
> > > > >
> > > >
> > > > Change to +1, I figured out that it was due to the dependencies. I
> > still
> > > > have issue using DL base AMI with python3, but I will not regard it as
> > a
> > > > blocker to 1.5 release.
> > > > Tested Gluon-CV training and works fine.
> > > >
> > > > -Zhi
> > > >
> >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Lai Wei <ro...@gmail.com>.
Hi @dev,

I am canceling the vote as the issue Lin discovered require a fix[1] and
the solution is not ready yet.
It's a general problem when building from source with MXNet, not only
impacting horovod use cases.  Any help is appreciated.

Other issues we are tracking:
1. Regression on hybridize with static_alloc. (not a blocker for now)
2. Scala doc issue [2], already merged in master, need to backport to 1.5.x

Thanks for everyone's help! Please let us know if there is any other issue
with 1.5.0

[1] https://github.com/apache/incubator-mxnet/pull/15213
[2] https://github.com/apache/incubator-mxnet/pull/15216



Best Regards

Lai


On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <pe...@gmail.com>
wrote:

> Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
>
> Looks like a general regression.
>
>
> On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <ro...@gmail.com> wrote:
> >
> > Hi guys,
> >
> > Thanks for the updates. Currently, we are able to confirm Lin's issue
> with
> > Horovod, and there is a fix pending. [1]
> > Will update later today to see if we need to cancel this vote for the
> fix.
> >
> > As for the hybridize with static alloc performance regression. IMO it
> does
> > not need to be a blocker if we have the following speed order.
> > 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1 w/o static
> > and it will be great to know the following to better make a decision on
> > whether this should block the release.
> > 1) if this is a model specific or a general regression.
> > 2) if this is platform specific or general (w/ or w/o CUDA, w/ or w/o
> > MKLDNN)
> >
> >
> > [1]https://github.com/apache/incubator-mxnet/pull/15213
> >
> >
> > Thanks
> >
> > Best Regards
> >
> > Lai
> >
> >
> > On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zh...@apache.org> wrote:
> >
> > >
> > >
> > > On 2019/06/11 18:53:56, Pedro Larroy <pe...@gmail.com>
> > > wrote:
> > > > The stack trace doesn't seem to come from MXNet, do you have more
> info?
> > > >
> > > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zh...@apache.org>
> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 2019/06/11 17:36:09, Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> > > wrote:
> > > > > > A bit more background into this:
> > > > > >
> > > > > > While tuning a model using LSTM and convolutions we find that
> using
> > > > > > hybridize with static_alloc and static_shape is 15% slower in the
> > > > > > latest revision vs in version 1.4.1 in which using hybridize with
> > > > > > static_alloc and static_shape is 10% faster than without.
> > > > > >
> > > > > > Overwall we are still 33% faster when comparing master to 1.5.
> > > > > >
> > > > > > Let me know if you think this is a release blocker or not.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > > <pe...@gmail.com> wrote:
> > > > > > >
> > > > > > > -1
> > > > > > >
> > > > > > > We found a performance regression vs 1.4 related to CachedOp
> which
> > > > > > > affects Hybrid forward, which we are looking into.
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > -1 (Tentatively until resolved)
> > > > > > > >
> > > > > > > > I tried to build MXNet 1.5.0 from source and pip install
> horovod
> > > but got
> > > > > > > > the following error:
> > > > > > > >
> > > > > > > > Reproduce:
> > > > > > > > 1) cp make/config.mk .
> > > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > > > 3) make -j
> > > > > > > >
> > > > > > > > MXNet can build successfully.
> > > > > > > >
> > > > > > > > 4) pip install horovod
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > >
> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > > fatal error: mkldnn_version.h: No such file or directory
> > > > > > > >     compilation terminated.
> > > > > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > > > > >
> > > > > > > > I did not change any setting of MKLDNN in my config.mk. I am
> > > building on
> > > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Lin
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <
> yajiedesign@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > > > > > > >
> > > > > > > > > > Dear MXNet community,
> > > > > > > > > >
> > > > > > > > > > This is the 3-day vote to release Apache MXNet
> (incubating)
> > > version
> > > > > > > > > 1.5.0.
> > > > > > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and
> close
> > > on June 11,
> > > > > > > > > > 23:59:59.
> > > > > > > > > >
> > > > > > > > > > 1) Link to release notes:
> > > > > > > > > >
> > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > > >
> > > > > > > > > > 2) Link to release candidate:
> > > > > > > > > >
> > > > > > > > > >
> > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > > >
> > > > > > > > > > 3) Link to source and signatures on apache dist server:
> > > > > > > > > >
> > > > > > > > > >
> > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > > > > > +1 = approve
> > > > > > > > > > +0 = no opinion
> > > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best Regards
> > > > > > > > > >
> > > > > > > > > > Lai
> > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > > > > -1. Built from source, import mxnet in python cause Segfault.
> > > > >
> > > > > back trace:
> > > > >
> > > > > Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> > > > > 0x00007fff3e8a9f20 in ?? ()
> > > > > (gdb) bt
> > > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> > > from
> > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> > > from
> > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > #4  0x00007ffff29d5c48 in ?? () from
> /usr/lib/python3/dist-packages/
> > > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > > > #7  0x000000000053fc97 in ?? ()
> > > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > > #9  0x000000000054a328 in ?? ()
> > > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > > > #12 0x000000000053fc97 in ?? ()
> > > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > > #18 0x00000000004ec2e3 in ?? ()
> > > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > > >
> > > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with
> > > USE_CUDA=1,
> > > > > USE_CUDNN=1, the rest are default values.
> > > > >
> > > > > -Zhi
> > > >
> > >
> > > Change to +1, I figured out that it was due to the dependencies. I
> still
> > > have issue using DL base AMI with python3, but I will not regard it as
> a
> > > blocker to 1.5 release.
> > > Tested Gluon-CV training and works fine.
> > >
> > > -Zhi
> > >
>

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Pedro Larroy <pe...@gmail.com>.
Tested with CPU, 2.6x slower. comparing master vs 1.4.1.

Looks like a general regression.


On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <ro...@gmail.com> wrote:
>
> Hi guys,
>
> Thanks for the updates. Currently, we are able to confirm Lin's issue with
> Horovod, and there is a fix pending. [1]
> Will update later today to see if we need to cancel this vote for the fix.
>
> As for the hybridize with static alloc performance regression. IMO it does
> not need to be a blocker if we have the following speed order.
> 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1 w/o static
> and it will be great to know the following to better make a decision on
> whether this should block the release.
> 1) if this is a model specific or a general regression.
> 2) if this is platform specific or general (w/ or w/o CUDA, w/ or w/o
> MKLDNN)
>
>
> [1]https://github.com/apache/incubator-mxnet/pull/15213
>
>
> Thanks
>
> Best Regards
>
> Lai
>
>
> On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zh...@apache.org> wrote:
>
> >
> >
> > On 2019/06/11 18:53:56, Pedro Larroy <pe...@gmail.com>
> > wrote:
> > > The stack trace doesn't seem to come from MXNet, do you have more info?
> > >
> > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zh...@apache.org> wrote:
> > > >
> > > >
> > > >
> > > > On 2019/06/11 17:36:09, Pedro Larroy <pe...@gmail.com>
> > wrote:
> > > > > A bit more background into this:
> > > > >
> > > > > While tuning a model using LSTM and convolutions we find that using
> > > > > hybridize with static_alloc and static_shape is 15% slower in the
> > > > > latest revision vs in version 1.4.1 in which using hybridize with
> > > > > static_alloc and static_shape is 10% faster than without.
> > > > >
> > > > > Overwall we are still 33% faster when comparing master to 1.5.
> > > > >
> > > > > Let me know if you think this is a release blocker or not.
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > <pe...@gmail.com> wrote:
> > > > > >
> > > > > > -1
> > > > > >
> > > > > > We found a performance regression vs 1.4 related to CachedOp which
> > > > > > affects Hybrid forward, which we are looking into.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > -1 (Tentatively until resolved)
> > > > > > >
> > > > > > > I tried to build MXNet 1.5.0 from source and pip install horovod
> > but got
> > > > > > > the following error:
> > > > > > >
> > > > > > > Reproduce:
> > > > > > > 1) cp make/config.mk .
> > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > > 3) make -j
> > > > > > >
> > > > > > > MXNet can build successfully.
> > > > > > >
> > > > > > > 4) pip install horovod
> > > > > > >
> > > > > > >
> > > > > > >
> > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > fatal error: mkldnn_version.h: No such file or directory
> > > > > > >     compilation terminated.
> > > > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > > > >
> > > > > > > I did not change any setting of MKLDNN in my config.mk. I am
> > building on
> > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Lin
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > > > > > >
> > > > > > > > > Dear MXNet community,
> > > > > > > > >
> > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> > version
> > > > > > > > 1.5.0.
> > > > > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and close
> > on June 11,
> > > > > > > > > 23:59:59.
> > > > > > > > >
> > > > > > > > > 1) Link to release notes:
> > > > > > > > >
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > >
> > > > > > > > > 2) Link to release candidate:
> > > > > > > > >
> > > > > > > > >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > >
> > > > > > > > > 3) Link to source and signatures on apache dist server:
> > > > > > > > >
> > > > > > > > >
> > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > > > > +1 = approve
> > > > > > > > > +0 = no opinion
> > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > >
> > > > > > > > > Lai
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > > > -1. Built from source, import mxnet in python cause Segfault.
> > > >
> > > > back trace:
> > > >
> > > > Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> > > > 0x00007fff3e8a9f20 in ?? ()
> > > > (gdb) bt
> > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> > from
> > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> > from
> > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > #4  0x00007ffff29d5c48 in ?? () from /usr/lib/python3/dist-packages/
> > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > > #7  0x000000000053fc97 in ?? ()
> > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > #9  0x000000000054a328 in ?? ()
> > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > > #12 0x000000000053fc97 in ?? ()
> > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > #18 0x00000000004ec2e3 in ?? ()
> > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > >
> > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with
> > USE_CUDA=1,
> > > > USE_CUDNN=1, the rest are default values.
> > > >
> > > > -Zhi
> > >
> >
> > Change to +1, I figured out that it was due to the dependencies. I still
> > have issue using DL base AMI with python3, but I will not regard it as a
> > blocker to 1.5 release.
> > Tested Gluon-CV training and works fine.
> >
> > -Zhi
> >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Lai Wei <ro...@gmail.com>.
Hi guys,

Thanks for the updates. Currently, we are able to confirm Lin's issue with
Horovod, and there is a fix pending. [1]
Will update later today to see if we need to cancel this vote for the fix.

As for the hybridize with static alloc performance regression. IMO it does
not need to be a blocker if we have the following speed order.
1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1 w/o static
and it will be great to know the following to better make a decision on
whether this should block the release.
1) if this is a model specific or a general regression.
2) if this is platform specific or general (w/ or w/o CUDA, w/ or w/o
MKLDNN)


[1]https://github.com/apache/incubator-mxnet/pull/15213


Thanks

Best Regards

Lai


On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zh...@apache.org> wrote:

>
>
> On 2019/06/11 18:53:56, Pedro Larroy <pe...@gmail.com>
> wrote:
> > The stack trace doesn't seem to come from MXNet, do you have more info?
> >
> > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zh...@apache.org> wrote:
> > >
> > >
> > >
> > > On 2019/06/11 17:36:09, Pedro Larroy <pe...@gmail.com>
> wrote:
> > > > A bit more background into this:
> > > >
> > > > While tuning a model using LSTM and convolutions we find that using
> > > > hybridize with static_alloc and static_shape is 15% slower in the
> > > > latest revision vs in version 1.4.1 in which using hybridize with
> > > > static_alloc and static_shape is 10% faster than without.
> > > >
> > > > Overwall we are still 33% faster when comparing master to 1.5.
> > > >
> > > > Let me know if you think this is a release blocker or not.
> > > >
> > > > Pedro.
> > > >
> > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > <pe...@gmail.com> wrote:
> > > > >
> > > > > -1
> > > > >
> > > > > We found a performance regression vs 1.4 related to CachedOp which
> > > > > affects Hybrid forward, which we are looking into.
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com>
> wrote:
> > > > > >
> > > > > > -1 (Tentatively until resolved)
> > > > > >
> > > > > > I tried to build MXNet 1.5.0 from source and pip install horovod
> but got
> > > > > > the following error:
> > > > > >
> > > > > > Reproduce:
> > > > > > 1) cp make/config.mk .
> > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > 3) make -j
> > > > > >
> > > > > > MXNet can build successfully.
> > > > > >
> > > > > > 4) pip install horovod
> > > > > >
> > > > > >
> > > > > >
> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > fatal error: mkldnn_version.h: No such file or directory
> > > > > >     compilation terminated.
> > > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > > >
> > > > > > I did not change any setting of MKLDNN in my config.mk. I am
> building on
> > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Lin
> > > > > >
> > > > > >
> > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com>
> wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > > > > >
> > > > > > > > Dear MXNet community,
> > > > > > > >
> > > > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> version
> > > > > > > 1.5.0.
> > > > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and close
> on June 11,
> > > > > > > > 23:59:59.
> > > > > > > >
> > > > > > > > 1) Link to release notes:
> > > > > > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > >
> > > > > > > > 2) Link to release candidate:
> > > > > > > >
> > > > > > > >
> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > >
> > > > > > > > 3) Link to source and signatures on apache dist server:
> > > > > > > >
> > > > > > > >
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > >
> > > > > > > >
> > > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > > > +1 = approve
> > > > > > > > +0 = no opinion
> > > > > > > > -1 = disapprove (provide reason)
> > > > > > > >
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > >
> > > > > > > > Lai
> > > > > > > >
> > > > > > >
> > > >
> > >
> > > -1. Built from source, import mxnet in python cause Segfault.
> > >
> > > back trace:
> > >
> > > Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> > > 0x00007fff3e8a9f20 in ?? ()
> > > (gdb) bt
> > > #0  0x00007fff3e8a9f20 in ?? ()
> > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> from
> > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > std::allocator<char> > const&, bool const&, unsigned int const&) ()
> from
> > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > #4  0x00007ffff29d5c48 in ?? () from /usr/lib/python3/dist-packages/
> > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > #7  0x000000000053fc97 in ?? ()
> > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > #9  0x000000000054a328 in ?? ()
> > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > #12 0x000000000053fc97 in ?? ()
> > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > #18 0x00000000004ec2e3 in ?? ()
> > > #19 0x00000000005c20e7 in PyObject_Call ()
> > >
> > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with
> USE_CUDA=1,
> > > USE_CUDNN=1, the rest are default values.
> > >
> > > -Zhi
> >
>
> Change to +1, I figured out that it was due to the dependencies. I still
> have issue using DL base AMI with python3, but I will not regard it as a
> blocker to 1.5 release.
> Tested Gluon-CV training and works fine.
>
> -Zhi
>

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Zhi Zhang <zh...@apache.org>.

On 2019/06/11 18:53:56, Pedro Larroy <pe...@gmail.com> wrote: 
> The stack trace doesn't seem to come from MXNet, do you have more info?
> 
> On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zh...@apache.org> wrote:
> >
> >
> >
> > On 2019/06/11 17:36:09, Pedro Larroy <pe...@gmail.com> wrote:
> > > A bit more background into this:
> > >
> > > While tuning a model using LSTM and convolutions we find that using
> > > hybridize with static_alloc and static_shape is 15% slower in the
> > > latest revision vs in version 1.4.1 in which using hybridize with
> > > static_alloc and static_shape is 10% faster than without.
> > >
> > > Overwall we are still 33% faster when comparing master to 1.5.
> > >
> > > Let me know if you think this is a release blocker or not.
> > >
> > > Pedro.
> > >
> > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > <pe...@gmail.com> wrote:
> > > >
> > > > -1
> > > >
> > > > We found a performance regression vs 1.4 related to CachedOp which
> > > > affects Hybrid forward, which we are looking into.
> > > >
> > > > Pedro.
> > > >
> > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com> wrote:
> > > > >
> > > > > -1 (Tentatively until resolved)
> > > > >
> > > > > I tried to build MXNet 1.5.0 from source and pip install horovod but got
> > > > > the following error:
> > > > >
> > > > > Reproduce:
> > > > > 1) cp make/config.mk .
> > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > 3) make -j
> > > > >
> > > > > MXNet can build successfully.
> > > > >
> > > > > 4) pip install horovod
> > > > >
> > > > >
> > > > > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > fatal error: mkldnn_version.h: No such file or directory
> > > > >     compilation terminated.
> > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > >
> > > > > I did not change any setting of MKLDNN in my config.mk. I am building on
> > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Lin
> > > > >
> > > > >
> > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com> wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > > > >
> > > > > > > Dear MXNet community,
> > > > > > >
> > > > > > > This is the 3-day vote to release Apache MXNet (incubating) version
> > > > > > 1.5.0.
> > > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and close on June 11,
> > > > > > > 23:59:59.
> > > > > > >
> > > > > > > 1) Link to release notes:
> > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > >
> > > > > > > 2) Link to release candidate:
> > > > > > >
> > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > >
> > > > > > > 3) Link to source and signatures on apache dist server:
> > > > > > >
> > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > >
> > > > > > >
> > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > > +1 = approve
> > > > > > > +0 = no opinion
> > > > > > > -1 = disapprove (provide reason)
> > > > > > >
> > > > > > >
> > > > > > > Best Regards
> > > > > > >
> > > > > > > Lai
> > > > > > >
> > > > > >
> > >
> >
> > -1. Built from source, import mxnet in python cause Segfault.
> >
> > back trace:
> >
> > Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> > 0x00007fff3e8a9f20 in ?? ()
> > (gdb) bt
> > #0  0x00007fff3e8a9f20 in ?? ()
> > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > const&, bool const&, unsigned int const&) () from
> > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > const&, bool const&, unsigned int const&) () from
> > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > #4  0x00007ffff29d5c48 in ?? () from /usr/lib/python3/dist-packages/
> > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > #5  0x00000000004ea10f in PyCFunction_Call ()
> > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > #7  0x000000000053fc97 in ?? ()
> > #8  0x00000000005409bf in PyEval_EvalCode ()
> > #9  0x000000000054a328 in ?? ()
> > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > #12 0x000000000053fc97 in ?? ()
> > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > #18 0x00000000004ec2e3 in ?? ()
> > #19 0x00000000005c20e7 in PyObject_Call ()
> >
> > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with USE_CUDA=1,
> > USE_CUDNN=1, the rest are default values.
> >
> > -Zhi
> 

Change to +1, I figured out that it was due to the dependencies. I still have issue using DL base AMI with python3, but I will not regard it as a blocker to 1.5 release. 
Tested Gluon-CV training and works fine.

-Zhi

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Pedro Larroy <pe...@gmail.com>.
The stack trace doesn't seem to come from MXNet, do you have more info?

On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zh...@apache.org> wrote:
>
>
>
> On 2019/06/11 17:36:09, Pedro Larroy <pe...@gmail.com> wrote:
> > A bit more background into this:
> >
> > While tuning a model using LSTM and convolutions we find that using
> > hybridize with static_alloc and static_shape is 15% slower in the
> > latest revision vs in version 1.4.1 in which using hybridize with
> > static_alloc and static_shape is 10% faster than without.
> >
> > Overwall we are still 33% faster when comparing master to 1.5.
> >
> > Let me know if you think this is a release blocker or not.
> >
> > Pedro.
> >
> > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > <pe...@gmail.com> wrote:
> > >
> > > -1
> > >
> > > We found a performance regression vs 1.4 related to CachedOp which
> > > affects Hybrid forward, which we are looking into.
> > >
> > > Pedro.
> > >
> > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com> wrote:
> > > >
> > > > -1 (Tentatively until resolved)
> > > >
> > > > I tried to build MXNet 1.5.0 from source and pip install horovod but got
> > > > the following error:
> > > >
> > > > Reproduce:
> > > > 1) cp make/config.mk .
> > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > 3) make -j
> > > >
> > > > MXNet can build successfully.
> > > >
> > > > 4) pip install horovod
> > > >
> > > >
> > > > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > fatal error: mkldnn_version.h: No such file or directory
> > > >     compilation terminated.
> > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > >
> > > > I did not change any setting of MKLDNN in my config.mk. I am building on
> > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > >
> > > > Thanks,
> > > >
> > > > Lin
> > > >
> > > >
> > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com> wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > > >
> > > > > > Dear MXNet community,
> > > > > >
> > > > > > This is the 3-day vote to release Apache MXNet (incubating) version
> > > > > 1.5.0.
> > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and close on June 11,
> > > > > > 23:59:59.
> > > > > >
> > > > > > 1) Link to release notes:
> > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > >
> > > > > > 2) Link to release candidate:
> > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > >
> > > > > > 3) Link to source and signatures on apache dist server:
> > > > > >
> > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > >
> > > > > >
> > > > > > Please remember to TEST first before voting accordingly:
> > > > > > +1 = approve
> > > > > > +0 = no opinion
> > > > > > -1 = disapprove (provide reason)
> > > > > >
> > > > > >
> > > > > > Best Regards
> > > > > >
> > > > > > Lai
> > > > > >
> > > > >
> >
>
> -1. Built from source, import mxnet in python cause Segfault.
>
> back trace:
>
> Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> 0x00007fff3e8a9f20 in ?? ()
> (gdb) bt
> #0  0x00007fff3e8a9f20 in ?? ()
> #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, bool const&, unsigned int const&) () from
> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, bool const&, unsigned int const&) () from
> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> #4  0x00007ffff29d5c48 in ?? () from /usr/lib/python3/dist-packages/
> apt_pkg.cpython-35m-x86_64-linux-gnu.so
> #5  0x00000000004ea10f in PyCFunction_Call ()
> #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> #7  0x000000000053fc97 in ?? ()
> #8  0x00000000005409bf in PyEval_EvalCode ()
> #9  0x000000000054a328 in ?? ()
> #10 0x00000000004ea1c6 in PyCFunction_Call ()
> #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> #12 0x000000000053fc97 in ?? ()
> #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> #18 0x00000000004ec2e3 in ?? ()
> #19 0x00000000005c20e7 in PyObject_Call ()
>
> I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with USE_CUDA=1,
> USE_CUDNN=1, the rest are default values.
>
> -Zhi

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Zhi Zhang <zh...@apache.org>.

On 2019/06/11 17:36:09, Pedro Larroy <pe...@gmail.com> wrote: 
> A bit more background into this:
> 
> While tuning a model using LSTM and convolutions we find that using
> hybridize with static_alloc and static_shape is 15% slower in the
> latest revision vs in version 1.4.1 in which using hybridize with
> static_alloc and static_shape is 10% faster than without.
> 
> Overwall we are still 33% faster when comparing master to 1.5.
> 
> Let me know if you think this is a release blocker or not.
> 
> Pedro.
> 
> On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> <pe...@gmail.com> wrote:
> >
> > -1
> >
> > We found a performance regression vs 1.4 related to CachedOp which
> > affects Hybrid forward, which we are looking into.
> >
> > Pedro.
> >
> > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com> wrote:
> > >
> > > -1 (Tentatively until resolved)
> > >
> > > I tried to build MXNet 1.5.0 from source and pip install horovod but got
> > > the following error:
> > >
> > > Reproduce:
> > > 1) cp make/config.mk .
> > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > 3) make -j
> > >
> > > MXNet can build successfully.
> > >
> > > 4) pip install horovod
> > >
> > >
> > > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > fatal error: mkldnn_version.h: No such file or directory
> > >     compilation terminated.
> > >     INFO: Unable to build MXNet plugin, will skip it.
> > >
> > > I did not change any setting of MKLDNN in my config.mk. I am building on
> > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > >
> > > Thanks,
> > >
> > > Lin
> > >
> > >
> > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com> wrote:
> > >
> > > > +1
> > > >
> > > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > This is the 3-day vote to release Apache MXNet (incubating) version
> > > > 1.5.0.
> > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and close on June 11,
> > > > > 23:59:59.
> > > > >
> > > > > 1) Link to release notes:
> > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > >
> > > > > 2) Link to release candidate:
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > >
> > > > > 3) Link to source and signatures on apache dist server:
> > > > >
> > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > >
> > > > >
> > > > > Please remember to TEST first before voting accordingly:
> > > > > +1 = approve
> > > > > +0 = no opinion
> > > > > -1 = disapprove (provide reason)
> > > > >
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Lai
> > > > >
> > > >
> 

-1. Built from source, import mxnet in python cause Segfault.

back trace:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fff3e8a9f20 in ?? ()
(gdb) bt
#0  0x00007fff3e8a9f20 in ?? ()
#1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, bool const&, unsigned int const&) () from
/usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
#2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, bool const&, unsigned int const&) () from
/usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
#3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
/usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
#4  0x00007ffff29d5c48 in ?? () from /usr/lib/python3/dist-packages/
apt_pkg.cpython-35m-x86_64-linux-gnu.so
#5  0x00000000004ea10f in PyCFunction_Call ()
#6  0x0000000000536d94 in PyEval_EvalFrameEx ()
#7  0x000000000053fc97 in ?? ()
#8  0x00000000005409bf in PyEval_EvalCode ()
#9  0x000000000054a328 in ?? ()
#10 0x00000000004ea1c6 in PyCFunction_Call ()
#11 0x000000000053d353 in PyEval_EvalFrameEx ()
#12 0x000000000053fc97 in ?? ()
#13 0x000000000053bc93 in PyEval_EvalFrameEx ()
#14 0x000000000053b294 in PyEval_EvalFrameEx ()
#15 0x000000000053b294 in PyEval_EvalFrameEx ()
#16 0x000000000053b294 in PyEval_EvalFrameEx ()
#17 0x0000000000540b0b in PyEval_EvalCodeEx ()
#18 0x00000000004ec2e3 in ?? ()
#19 0x00000000005c20e7 in PyObject_Call ()

I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with USE_CUDA=1,
USE_CUDNN=1, the rest are default values.

-Zhi

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Pedro Larroy <pe...@gmail.com>.
A bit more background into this:

While tuning a model using LSTM and convolutions we find that using
hybridize with static_alloc and static_shape is 15% slower in the
latest revision vs in version 1.4.1 in which using hybridize with
static_alloc and static_shape is 10% faster than without.

Overwall we are still 33% faster when comparing master to 1.5.

Let me know if you think this is a release blocker or not.

Pedro.

On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
<pe...@gmail.com> wrote:
>
> -1
>
> We found a performance regression vs 1.4 related to CachedOp which
> affects Hybrid forward, which we are looking into.
>
> Pedro.
>
> On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com> wrote:
> >
> > -1 (Tentatively until resolved)
> >
> > I tried to build MXNet 1.5.0 from source and pip install horovod but got
> > the following error:
> >
> > Reproduce:
> > 1) cp make/config.mk .
> > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > 3) make -j
> >
> > MXNet can build successfully.
> >
> > 4) pip install horovod
> >
> >
> > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > fatal error: mkldnn_version.h: No such file or directory
> >     compilation terminated.
> >     INFO: Unable to build MXNet plugin, will skip it.
> >
> > I did not change any setting of MKLDNN in my config.mk. I am building on
> > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> >
> > Thanks,
> >
> > Lin
> >
> >
> > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com> wrote:
> >
> > > +1
> > >
> > > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> > >
> > > > Dear MXNet community,
> > > >
> > > > This is the 3-day vote to release Apache MXNet (incubating) version
> > > 1.5.0.
> > > > Voting on dev@ will start June 8, 23:59:59(PST)  and close on June 11,
> > > > 23:59:59.
> > > >
> > > > 1) Link to release notes:
> > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > >
> > > > 2) Link to release candidate:
> > > >
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > >
> > > > 3) Link to source and signatures on apache dist server:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > >
> > > >
> > > > Please remember to TEST first before voting accordingly:
> > > > +1 = approve
> > > > +0 = no opinion
> > > > -1 = disapprove (provide reason)
> > > >
> > > >
> > > > Best Regards
> > > >
> > > > Lai
> > > >
> > >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Pedro Larroy <pe...@gmail.com>.
-1

We found a performance regression vs 1.4 related to CachedOp which
affects Hybrid forward, which we are looking into.

Pedro.

On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <ap...@gmail.com> wrote:
>
> -1 (Tentatively until resolved)
>
> I tried to build MXNet 1.5.0 from source and pip install horovod but got
> the following error:
>
> Reproduce:
> 1) cp make/config.mk .
> 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> 3) make -j
>
> MXNet can build successfully.
>
> 4) pip install horovod
>
>
> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> fatal error: mkldnn_version.h: No such file or directory
>     compilation terminated.
>     INFO: Unable to build MXNet plugin, will skip it.
>
> I did not change any setting of MKLDNN in my config.mk. I am building on
> DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
>
> Thanks,
>
> Lin
>
>
> On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com> wrote:
>
> > +1
> >
> > Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
> >
> > > Dear MXNet community,
> > >
> > > This is the 3-day vote to release Apache MXNet (incubating) version
> > 1.5.0.
> > > Voting on dev@ will start June 8, 23:59:59(PST)  and close on June 11,
> > > 23:59:59.
> > >
> > > 1) Link to release notes:
> > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > >
> > > 2) Link to release candidate:
> > >
> > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > >
> > > 3) Link to source and signatures on apache dist server:
> > >
> > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > >
> > >
> > > Please remember to TEST first before voting accordingly:
> > > +1 = approve
> > > +0 = no opinion
> > > -1 = disapprove (provide reason)
> > >
> > >
> > > Best Regards
> > >
> > > Lai
> > >
> >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by Lin Yuan <ap...@gmail.com>.
-1 (Tentatively until resolved)

I tried to build MXNet 1.5.0 from source and pip install horovod but got
the following error:

Reproduce:
1) cp make/config.mk .
2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
3) make -j

MXNet can build successfully.

4) pip install horovod


/home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
fatal error: mkldnn_version.h: No such file or directory
    compilation terminated.
    INFO: Unable to build MXNet plugin, will skip it.

I did not change any setting of MKLDNN in my config.mk. I am building on
DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0

Thanks,

Lin


On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <ya...@gmail.com> wrote:

> +1
>
> Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:
>
> > Dear MXNet community,
> >
> > This is the 3-day vote to release Apache MXNet (incubating) version
> 1.5.0.
> > Voting on dev@ will start June 8, 23:59:59(PST)  and close on June 11,
> > 23:59:59.
> >
> > 1) Link to release notes:
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> >
> > 2) Link to release candidate:
> >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> >
> > 3) Link to source and signatures on apache dist server:
> >
> > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> >
> >
> > Please remember to TEST first before voting accordingly:
> > +1 = approve
> > +0 = no opinion
> > -1 = disapprove (provide reason)
> >
> >
> > Best Regards
> >
> > Lai
> >
>

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Posted by shiwen hu <ya...@gmail.com>.
+1

Lai Wei <ro...@gmail.com> 于2019年6月9日周日 上午4:12写道:

> Dear MXNet community,
>
> This is the 3-day vote to release Apache MXNet (incubating) version 1.5.0.
> Voting on dev@ will start June 8, 23:59:59(PST)  and close on June 11,
> 23:59:59.
>
> 1) Link to release notes:
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
>
> 2) Link to release candidate:
>
> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
>
> 3) Link to source and signatures on apache dist server:
>
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
>
>
> Please remember to TEST first before voting accordingly:
> +1 = approve
> +0 = no opinion
> -1 = disapprove (provide reason)
>
>
> Best Regards
>
> Lai
>