You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Pedro Larroy <pe...@gmail.com> on 2018/05/02 18:50:22 UTC

segmentation fault in master using mkdlnn

Hi

Seems master is not running  anymore, there's a segmentation fault using
MKDLNN-CPU

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/801/pipeline/662


I see my PRs failing with a similar error.

Pedro

Re: segmentation fault in master using mkdlnn

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Bhavin

Good suggestion

I tried 1) but I can't get a core inside the container, even with ulimit -c
unlimited
I found out that  /proc/sys/kernel/core_pattern  by default in ubuntu uses
a pipe to /usr/share/apport/apport  which doesn't exist inside the
container,

changing it outside the container to echo 'core.%h.%e.%t' >
/proc/sys/kernel/core_pattern  fixes this mistery, so now I got a coredump
which I added to the ticket.

Trying to get to the bottom of the issue :-)



On Thu, May 3, 2018 at 4:02 PM, Bhavin Thaker <bh...@gmail.com>
wrote:

> Hi Pedro, All,
>
> 1) I would suggest that we run “ulimit -c unlimited” in every CI Slave
> machine at startup to enable core-dump and get stack trace.
>
> 2) Valgrind on Python generates so much noise that extracting signal from
> it is painful, but it is still worth trying it out and look at the messages
> towards the end when the crash happens.  Valgrind on a one-liner python
> code generates noise and this demonstrates that python itself is not
> Valgrind-clean.
>
> 3) If there are C++ APIs to trigger the same functionality as the current
> problematic use-case, then one could write a small program to reproduce the
> crash and then use Valgrind to get to the culprit portion of the code
> quickly.
>
> Bhavin Thaker.
>
> On Thu, May 3, 2018 at 6:49 AM Pedro Larroy <pe...@gmail.com>
> wrote:
>
> > It's very difficult to reproduce, non-deterministic. We were also running
> > without signal handlers in CI so there are no stack traces unfortunately.
> >
> > Care to elaborate why valgrind doesn't work with Python?
> >
> >
> >
> > On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com> wrote:
> >
> > > can we build it in CI？segfault doesn't happen infrequently.
> > >
> > > 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
> > >
> > > > you can try Intel Inspector, which is like an enhanced version of
> > > valgrind
> > > > with a GUI and whatnot.
> > > >
> > > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com>
> wrote:
> > > >
> > > > > valgrind doesn't work with Python. also, valgrind doesn't support
> > some
> > > > > CPU instructions used by MXNet (I think some instructions related
> to
> > > > > random generator).
> > > > >
> > > > >
> > > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
> > bhavinthaker@gmail.com>
> > > > > wrote:
> > > > > > Have you tried running with valgrind to get some clues on the
> > > > root-cause?
> > > > > >
> > > > > > Bhavin Thaker.
> > > > > >
> > > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com>
> > > wrote:
> > > > > >
> > > > > >> It might also be possible that this isn't an MKLDNN bug.
> > > > > >> I just saw a similar memory error without MKLDNN build.
> > > > > >>
> > > > > >>
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > incubator-mxnet/detail/PR-10783/1/pipeline
> > > > > >>
> > > > > >> Best,
> > > > > >> Da
> > > > > >>
> > > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com>
> > > wrote:
> > > > > >> > There might be a race condition that causes the memory error.
> > > > > >> > It might be caused by this PR:
> > > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
> > > > > >> > This PR removes MKLDNN memory from NDArray.
> > > > > >> > However, I don't know why this causes memory error. If someone
> > is
> > > > > using
> > > > > >> the memory, it should still hold the memory with shared pointer.
> > > > > >> > But I do see the memory error increase after this PR is
> merged.
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Da
> > > > > >> >
> > > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
> > > pedro.larroy.lists@gmail.com>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >     I couldn't reproduce locally with:
> > > > > >> >
> > > > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> > > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform
> ubuntu_cpu
> > > > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> > > > > >> >
> > > > > >> >
> > > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
> > > > > >> pedro.larroy.lists@gmail.com>
> > > > > >> >     wrote:
> > > > > >> >
> > > > > >> >     > Hi
> > > > > >> >     >
> > > > > >> >     > Seems master is not running  anymore, there's a
> > segmentation
> > > > > fault
> > > > > >> using
> > > > > >> >     > MKDLNN-CPU
> > > > > >> >     >
> > > > > >> >     >
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
> > > > > >> >     >
> > > > > >> >     >
> > > > > >> >     > I see my PRs failing with a similar error.
> > > > > >> >     >
> > > > > >> >     > Pedro
> > > > > >> >     >
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: segmentation fault in master using mkdlnn

Posted by Bhavin Thaker <bh...@gmail.com>.

Hi Pedro, All,

1) I would suggest that we run “ulimit -c unlimited” in every CI Slave
machine at startup to enable core-dump and get stack trace.

2) Valgrind on Python generates so much noise that extracting signal from
it is painful, but it is still worth trying it out and look at the messages
towards the end when the crash happens.  Valgrind on a one-liner python
code generates noise and this demonstrates that python itself is not
Valgrind-clean.

3) If there are C++ APIs to trigger the same functionality as the current
problematic use-case, then one could write a small program to reproduce the
crash and then use Valgrind to get to the culprit portion of the code
quickly.

Bhavin Thaker.

On Thu, May 3, 2018 at 6:49 AM Pedro Larroy <pe...@gmail.com>
wrote:

> It's very difficult to reproduce, non-deterministic. We were also running
> without signal handlers in CI so there are no stack traces unfortunately.
>
> Care to elaborate why valgrind doesn't work with Python?
>
>
>
> On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com> wrote:
>
> > can we build it in CI？segfault doesn't happen infrequently.
> >
> > 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
> >
> > > you can try Intel Inspector, which is like an enhanced version of
> > valgrind
> > > with a GUI and whatnot.
> > >
> > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com> wrote:
> > >
> > > > valgrind doesn't work with Python. also, valgrind doesn't support
> some
> > > > CPU instructions used by MXNet (I think some instructions related to
> > > > random generator).
> > > >
> > > >
> > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
> bhavinthaker@gmail.com>
> > > > wrote:
> > > > > Have you tried running with valgrind to get some clues on the
> > > root-cause?
> > > > >
> > > > > Bhavin Thaker.
> > > > >
> > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com>
> > wrote:
> > > > >
> > > > >> It might also be possible that this isn't an MKLDNN bug.
> > > > >> I just saw a similar memory error without MKLDNN build.
> > > > >>
> > > > >>
> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > incubator-mxnet/detail/PR-10783/1/pipeline
> > > > >>
> > > > >> Best,
> > > > >> Da
> > > > >>
> > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com>
> > wrote:
> > > > >> > There might be a race condition that causes the memory error.
> > > > >> > It might be caused by this PR:
> > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
> > > > >> > This PR removes MKLDNN memory from NDArray.
> > > > >> > However, I don't know why this causes memory error. If someone
> is
> > > > using
> > > > >> the memory, it should still hold the memory with shared pointer.
> > > > >> > But I do see the memory error increase after this PR is merged.
> > > > >> >
> > > > >> > Best,
> > > > >> > Da
> > > > >> >
> > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
> > pedro.larroy.lists@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> >     I couldn't reproduce locally with:
> > > > >> >
> > > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> > > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> > > > >> >
> > > > >> >
> > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
> > > > >> pedro.larroy.lists@gmail.com>
> > > > >> >     wrote:
> > > > >> >
> > > > >> >     > Hi
> > > > >> >     >
> > > > >> >     > Seems master is not running  anymore, there's a
> segmentation
> > > > fault
> > > > >> using
> > > > >> >     > MKDLNN-CPU
> > > > >> >     >
> > > > >> >     >
> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
> > > > >> >     >
> > > > >> >     >
> > > > >> >     > I see my PRs failing with a similar error.
> > > > >> >     >
> > > > >> >     > Pedro
> > > > >> >     >
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: segmentation fault in master using mkdlnn

Posted by "Shaikh, Eftiquar" <ef...@amazon.com>.

If the issue is platform neutral - I can try reproducing on Windows. A fault in native code should produce a dump that can be analyzed. 
I am currently working on building mxnet from source, and can spend sometime on this. 

Sent from my iPhone

> On May 3, 2018, at 6:51 AM, Pedro Larroy <pe...@gmail.com> wrote:
> 
> It's very difficult to reproduce, non-deterministic. We were also running
> without signal handlers in CI so there are no stack traces unfortunately.
> 
> Care to elaborate why valgrind doesn't work with Python?
> 
> 
> 
>> On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com> wrote:
>> 
>> can we build it in CI？segfault doesn't happen infrequently.
>> 
>> 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
>> 
>>> you can try Intel Inspector, which is like an enhanced version of
>> valgrind
>>> with a GUI and whatnot.
>>> 
>>>> On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com> wrote:
>>>> 
>>>> valgrind doesn't work with Python. also, valgrind doesn't support some
>>>> CPU instructions used by MXNet (I think some instructions related to
>>>> random generator).
>>>> 
>>>> 
>>>> On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bh...@gmail.com>
>>>> wrote:
>>>>> Have you tried running with valgrind to get some clues on the
>>> root-cause?
>>>>> 
>>>>> Bhavin Thaker.
>>>>> 
>>>>> On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com>
>> wrote:
>>>>> 
>>>>>> It might also be possible that this isn't an MKLDNN bug.
>>>>>> I just saw a similar memory error without MKLDNN build.
>>>>>> 
>>>>>> 
>>>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>> incubator-mxnet/detail/PR-10783/1/pipeline
>>>>>> 
>>>>>> Best,
>>>>>> Da
>>>>>> 
>>>>>> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com>
>> wrote:
>>>>>>> There might be a race condition that causes the memory error.
>>>>>>> It might be caused by this PR:
>>>>>>> https://github.com/apache/incubator-mxnet/pull/10706/files
>>>>>>> This PR removes MKLDNN memory from NDArray.
>>>>>>> However, I don't know why this causes memory error. If someone is
>>>> using
>>>>>> the memory, it should still hold the memory with shared pointer.
>>>>>>> But I do see the memory error increase after this PR is merged.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Da
>>>>>>> 
>>>>>>> On 5/2/18, 12:26 PM, "Pedro Larroy" <
>> pedro.larroy.lists@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>    I couldn't reproduce locally with:
>>>>>>> 
>>>>>>>    ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>>>>>>>    build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>>>>>>>    /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>>>>>>> 
>>>>>>> 
>>>>>>>    On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>>>>>> pedro.larroy.lists@gmail.com>
>>>>>>>    wrote:
>>>>>>> 
>>>>>>>> Hi
>>>>>>>> 
>>>>>>>> Seems master is not running  anymore, there's a segmentation
>>>> fault
>>>>>> using
>>>>>>>> MKDLNN-CPU
>>>>>>>> 
>>>>>>>> 
>>>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>>>>>>> incubator-mxnet/detail/master/801/pipeline/662
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I see my PRs failing with a similar error.
>>>>>>>> 
>>>>>>>> Pedro
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>

Re: segmentation fault in master using mkdlnn

Posted by Pedro Larroy <pe...@gmail.com>.

I tried to compile with MKLDNN with Cmake + CLION and found some
difficulties, even though I have mkldnn in the 3rdparty folder and
installed mkl in user local.

What are exactly the steps to compile with MKLDNN with Cmake? I saw this
documented only for Make.

Pedro.

On Thu, May 3, 2018 at 4:59 PM, Pedro Larroy <pe...@gmail.com>
wrote:

> Hi Da
>
> Reproduction instructions:
>
> On the host:
>
> Adjust core pattern:
>
> $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
>
>
> Use the following patch:
>
> ===============
>
> diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
> --- a/3rdparty/mkldnn
> +++ b/3rdparty/mkldnn
> @@ -1 +1 @@
> -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
> +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
> diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.
> sh
> index 027e287..62649c9 100755
> --- a/ci/docker/runtime_functions.sh
> +++ b/ci/docker/runtime_functions.sh
> @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
>      # https://github.com/apache/incubator-mxnet/issues/10026
>      #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
>      export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> -    nosetests-2.7 --verbose tests/python/unittest
> -    nosetests-2.7 --verbose tests/python/train
> -    nosetests-2.7 --verbose tests/python/quantization
> +    export MXNET_TEST_SEED=11
> +    export MXNET_MODULE_SEED=812478194
> +    pwd
> +    export MXNET_TEST_COUNT=10000
> +    ulimit -c unlimited
> +    ulimit -c
> +    while nosetests-2.7 --verbose tests/python/unittest/test_
> module.py:test_forward_reshape; do echo round; done
> +    #nosetests-2.7 --verbose tests/python/train
> +    #nosetests-2.7 --verbose tests/python/quantization
>  }
>
>  unittest_ubuntu_python3_cpu() {
>
>
>
> ==============
>
> Build and execute the test, make sure the repo is clean
>
> $ ci/docker/runtime_functions.sh clean_repo
>
> $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
>
> Once it crashes it will stop.
>
> Then go in the container:
>
>
> $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
>
> A core should be there.
>
> you might need to install gdb as root by executing the previous command
> without uid so you can use apt-get.
>
>
>
>
> Good luck.
>
>
>
>
>
>
>
> On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dz...@amazon.com> wrote:
>
>> Thanks a lot for locating the error.
>> Could you tell me How you reproduce the error?
>>
>> On 5/3/18, 7:45 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
>>
>>     Looks like a problem in mkl's same_shape
>>
>>     the pointer to mkldnn::memory::desc &desc  looks invalid.
>>
>>     (More stack frames follow...)
>>     (gdb) p desc
>>     $1 = (const mkldnn::memory::desc &) @0x10: <error reading variable>
>>     (gdb) p dtype
>>     $2 = 0
>>     (gdb) p shape
>>     $3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>> =
>>     {static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_
>> = 0,
>>         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0},
>> <No data
>>     fields>}
>>     (gdb)
>>
>>
>>     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com> wrote:
>>
>>     > There are a few problems with valgrind, which makes it not an ideal
>> tool
>>     > for mxnet with python interface.
>>     >
>>     > First, valgrind generates a huge number of irrelevant messages,
>> most of
>>     > them from in Python itself.
>>     >
>>     > Second, valgrind can't emulate all CPU instructions. I remember
>> that when
>>     > I run valgrind with mxnet, valgrind exits with a strange error. I
>> later on
>>     > found that it was caused by an unsupported CPU instructions.
>>     >
>>     > Third, valgrind doesn't support multithreading well. As far as I
>> know,
>>     > valgrind runs everything in a single thread even if the program uses
>>     > multi-threading. An error like this, which is likely caused by race
>>     > condition, can't be caught by valgrind.
>>     >
>>     > I used to use Address Sanitizer for memory errors. This tool is much
>>     > faster and can work with multi-threads. However, it doesn't work
>> with
>>     > Python for some reason.
>>     >
>>     > One thing we potentially can do is to use memory checker for C++
>> unit
>>     > tests. Not sure it'll cover all memory errors we want.
>>     >
>>     > Best,
>>     > Da
>>     >
>>     > On 5/3/18, 6:50 AM, "Pedro Larroy" <pe...@gmail.com>
>> wrote:
>>     >
>>     >     It's very difficult to reproduce, non-deterministic. We were
>> also
>>     > running
>>     >     without signal handlers in CI so there are no stack traces
>>     > unfortunately.
>>     >
>>     >     Care to elaborate why valgrind doesn't work with Python?
>>     >
>>     >
>>     >
>>     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1936@gmail.com
>> >
>>     > wrote:
>>     >
>>     >     > can we build it in CI？segfault doesn't happen infrequently.
>>     >     >
>>     >     > 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
>>     >     >
>>     >     > > you can try Intel Inspector, which is like an enhanced
>> version of
>>     >     > valgrind
>>     >     > > with a GUI and whatnot.
>>     >     > >
>>     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
>> zhengda1936@gmail.com>
>>     > wrote:
>>     >     > >
>>     >     > > > valgrind doesn't work with Python. also, valgrind doesn't
>>     > support some
>>     >     > > > CPU instructions used by MXNet (I think some instructions
>>     > related to
>>     >     > > > random generator).
>>     >     > > >
>>     >     > > >
>>     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>>     > bhavinthaker@gmail.com>
>>     >     > > > wrote:
>>     >     > > > > Have you tried running with valgrind to get some clues
>> on the
>>     >     > > root-cause?
>>     >     > > > >
>>     >     > > > > Bhavin Thaker.
>>     >     > > > >
>>     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
>> zhengda1936@gmail.com
>>     > >
>>     >     > wrote:
>>     >     > > > >
>>     >     > > > >> It might also be possible that this isn't an MKLDNN
>> bug.
>>     >     > > > >> I just saw a similar memory error without MKLDNN build.
>>     >     > > > >>
>>     >     > > > >>
>>     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>>     > organizations/jenkins/
>>     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
>>     >     > > > >>
>>     >     > > > >> Best,
>>     >     > > > >> Da
>>     >     > > > >>
>>     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
>> dzzhen@amazon.com>
>>     >     > wrote:
>>     >     > > > >> > There might be a race condition that causes the
>> memory
>>     > error.
>>     >     > > > >> > It might be caused by this PR:
>>     >     > > > >> > https://github.com/apache/incu
>> bator-mxnet/pull/10706/files
>>     >     > > > >> > This PR removes MKLDNN memory from NDArray.
>>     >     > > > >> > However, I don't know why this causes memory error.
>> If
>>     > someone is
>>     >     > > > using
>>     >     > > > >> the memory, it should still hold the memory with shared
>>     > pointer.
>>     >     > > > >> > But I do see the memory error increase after this PR
>> is
>>     > merged.
>>     >     > > > >> >
>>     >     > > > >> > Best,
>>     >     > > > >> > Da
>>     >     > > > >> >
>>     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>>     >     > pedro.larroy.lists@gmail.com>
>>     >     > > > >> wrote:
>>     >     > > > >> >
>>     >     > > > >> >     I couldn't reproduce locally with:
>>     >     > > > >> >
>>     >     > > > >> >     ci/build.py -p ubuntu_cpu
>> /work/runtime_functions.sh
>>     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform
>>     > ubuntu_cpu
>>     >     > > > >> >     /work/runtime_functions.sh
>> unittest_ubuntu_python2_cpu
>>     >     > > > >> >
>>     >     > > > >> >
>>     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>>     >     > > > >> pedro.larroy.lists@gmail.com>
>>     >     > > > >> >     wrote:
>>     >     > > > >> >
>>     >     > > > >> >     > Hi
>>     >     > > > >> >     >
>>     >     > > > >> >     > Seems master is not running  anymore, there's a
>>     > segmentation
>>     >     > > > fault
>>     >     > > > >> using
>>     >     > > > >> >     > MKDLNN-CPU
>>     >     > > > >> >     >
>>     >     > > > >> >     >
>>     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>>     > organizations/jenkins/
>>     >     > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
>>     >     > > > >> >     >
>>     >     > > > >> >     >
>>     >     > > > >> >     > I see my PRs failing with a similar error.
>>     >     > > > >> >     >
>>     >     > > > >> >     > Pedro
>>     >     > > > >> >     >
>>     >     > > > >> >
>>     >     > > > >> >
>>     >     > > > >>
>>     >     > > >
>>     >     > >
>>     >     >
>>     >
>>     >
>>     >
>>
>>
>>
>

Re: segmentation fault in master using mkdlnn

Posted by Da Zheng <zh...@gmail.com>.

I have come up a temporary solution for this memory error.
https://github.com/apache/incubator-mxnet/pull/10812
I tested with Anirudh's command. It works fine.

I call it a temporary solution because it only fixes the segfault. It
seems to me that the race condition can potentially corrupt data in
the input array even without MKLDNN. Please see the description in my
PR for more details.

Best,
Da

On Fri, May 4, 2018 at 12:14 PM, Zheng, Da <dz...@amazon.com> wrote:
> Hello Pedro,
>
> I did exactly what you said in your previous email.
>
> I edit ci/docker/runtime_functions.sh based on your patch and here is the history of running your commands:
>  2004  vim ci/docker/runtime_functions.sh
>  2005  ci/docker/runtime_functions.sh clean_repo
>  2006  ci/build.py -p ubuntu_cpu /work/runtime_functions.sh build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
> Best,
> Da
>
> On 5/4/18, 4:32 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
>
>     Hi Da. I run it both in my ubuntu 16.04 workstation, in a p3 instance with
>     DLAMI. I'm pretty confident it runs in most linux environments.
>
>     Can you post the exact commands that you run? is not clear to me what's the
>     problem from your paste. Please make sure your repo is clean and all your
>     subrepos are clean before starting the docker build.
>
>     ci/docker/runtime_functions.sh clean_repo
>
>     Pedro.
>
>     On Thu, May 3, 2018 at 7:17 PM, Zheng, Da <dz...@amazon.com> wrote:
>
>     > Hello Pedro,
>     >
>     > I tried your instructions. It seems I can't run the docker in EC2
>     > instances.
>     > Where did you reproduce the error?
>     >
>     > Thanks,
>     > Da
>     >
>     > + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
>     > + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
>     > gpg: directory `/root/.gnupg' created
>     > gpg: new configuration file `/root/.gnupg/gpg.conf' created
>     > gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during
>     > this run
>     > gpg: keyring `/root/.gnupg/secring.gpg' created
>     > gpg: keyring `/root/.gnupg/pubring.gpg' created
>     > gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
>     > gpg: keyserver timed out
>     > gpg: keyserver receive failed: keyserver error
>     > The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
>     > Traceback (most recent call last):
>     >   File "ci/build.py", line 263, in <module>
>     >     sys.exit(main())
>     >   File "ci/build.py", line 197, in main
>     >     build_docker(platform, docker_binary)
>     >   File "ci/build.py", line 73, in build_docker
>     >     check_call(cmd)
>     >   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
>     >     raise CalledProcessError(retcode, cmd)
>     > subprocess.CalledProcessError: Command '['docker', 'build', '-f',
>     > 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
>     > '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2
>     >
>     >
>     > On 5/3/18, 8:01 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
>     >
>     >     Hi Da
>     >
>     >     Reproduction instructions:
>     >
>     >     On the host:
>     >
>     >     Adjust core pattern:
>     >
>     >     $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
>     >
>     >
>     >     Use the following patch:
>     >
>     >     ===============
>     >
>     >     diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
>     >     --- a/3rdparty/mkldnn
>     >     +++ b/3rdparty/mkldnn
>     >     @@ -1 +1 @@
>     >     -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
>     >     +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
>     >     diff --git a/ci/docker/runtime_functions.sh
>     > b/ci/docker/runtime_functions.sh
>     >     index 027e287..62649c9 100755
>     >     --- a/ci/docker/runtime_functions.sh
>     >     +++ b/ci/docker/runtime_functions.sh
>     >     @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
>     >          # https://github.com/apache/incubator-mxnet/issues/10026
>     >          #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
>     >          export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
>     >     -    nosetests-2.7 --verbose tests/python/unittest
>     >     -    nosetests-2.7 --verbose tests/python/train
>     >     -    nosetests-2.7 --verbose tests/python/quantization
>     >     +    export MXNET_TEST_SEED=11
>     >     +    export MXNET_MODULE_SEED=812478194
>     >     +    pwd
>     >     +    export MXNET_TEST_COUNT=10000
>     >     +    ulimit -c unlimited
>     >     +    ulimit -c
>     >     +    while nosetests-2.7 --verbose
>     >     tests/python/unittest/test_module.py:test_forward_reshape; do echo
>     > round;
>     >     done
>     >     +    #nosetests-2.7 --verbose tests/python/train
>     >     +    #nosetests-2.7 --verbose tests/python/quantization
>     >      }
>     >
>     >      unittest_ubuntu_python3_cpu() {
>     >
>     >
>     >
>     >     ==============
>     >
>     >     Build and execute the test, make sure the repo is clean
>     >
>     >     $ ci/docker/runtime_functions.sh clean_repo
>     >
>     >     $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>     >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>     >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>     >
>     >
>     >     Once it crashes it will stop.
>     >
>     >     Then go in the container:
>     >
>     >
>     >     $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
>     >
>     >     A core should be there.
>     >
>     >     you might need to install gdb as root by executing the previous command
>     >     without uid so you can use apt-get.
>     >
>     >
>     >
>     >
>     >     Good luck.
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >     On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dz...@amazon.com> wrote:
>     >
>     >     > Thanks a lot for locating the error.
>     >     > Could you tell me How you reproduce the error?
>     >     >
>     >     > On 5/3/18, 7:45 AM, "Pedro Larroy" <pe...@gmail.com>
>     > wrote:
>     >     >
>     >     >     Looks like a problem in mkl's same_shape
>     >     >
>     >     >     the pointer to mkldnn::memory::desc &desc  looks invalid.
>     >     >
>     >     >     (More stack frames follow...)
>     >     >     (gdb) p desc
>     >     >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading
>     > variable>
>     >     >     (gdb) p dtype
>     >     >     $2 = 0
>     >     >     (gdb) p shape
>     >     >     $3 = (const mxnet::TShape &) @0x7f3905a58b50:
>     > {<nnvm::Tuple<long>> =
>     >     >     {static kStackCache = <optimized out>, ndim_ = 2,
>     > num_heap_allocated_
>     >     > = 0,
>     >     >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ =
>     > 0x0}, <No
>     >     > data
>     >     >     fields>}
>     >     >     (gdb)
>     >     >
>     >     >
>     >     >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com>
>     > wrote:
>     >     >
>     >     >     > There are a few problems with valgrind, which makes it not an
>     > ideal
>     >     > tool
>     >     >     > for mxnet with python interface.
>     >     >     >
>     >     >     > First, valgrind generates a huge number of irrelevant
>     > messages, most
>     >     > of
>     >     >     > them from in Python itself.
>     >     >     >
>     >     >     > Second, valgrind can't emulate all CPU instructions. I
>     > remember that
>     >     > when
>     >     >     > I run valgrind with mxnet, valgrind exits with a strange
>     > error. I
>     >     > later on
>     >     >     > found that it was caused by an unsupported CPU instructions.
>     >     >     >
>     >     >     > Third, valgrind doesn't support multithreading well. As far as
>     > I
>     >     > know,
>     >     >     > valgrind runs everything in a single thread even if the
>     > program uses
>     >     >     > multi-threading. An error like this, which is likely caused by
>     > race
>     >     >     > condition, can't be caught by valgrind.
>     >     >     >
>     >     >     > I used to use Address Sanitizer for memory errors. This tool
>     > is much
>     >     >     > faster and can work with multi-threads. However, it doesn't
>     > work with
>     >     >     > Python for some reason.
>     >     >     >
>     >     >     > One thing we potentially can do is to use memory checker for
>     > C++ unit
>     >     >     > tests. Not sure it'll cover all memory errors we want.
>     >     >     >
>     >     >     > Best,
>     >     >     > Da
>     >     >     >
>     >     >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <
>     > pedro.larroy.lists@gmail.com>
>     >     > wrote:
>     >     >     >
>     >     >     >     It's very difficult to reproduce, non-deterministic. We
>     > were also
>     >     >     > running
>     >     >     >     without signal handlers in CI so there are no stack traces
>     >     >     > unfortunately.
>     >     >     >
>     >     >     >     Care to elaborate why valgrind doesn't work with Python?
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <
>     > zhengda1936@gmail.com>
>     >     >     > wrote:
>     >     >     >
>     >     >     >     > can we build it in CI？segfault doesn't happen
>     > infrequently.
>     >     >     >     >
>     >     >     >     > 2018年5月2日 下午11:34，"Chris Olivier" <cjolivier01@gmail.com
>     > >写道：
>     >     >     >     >
>     >     >     >     > > you can try Intel Inspector, which is like an enhanced
>     >     > version of
>     >     >     >     > valgrind
>     >     >     >     > > with a GUI and whatnot.
>     >     >     >     > >
>     >     >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
>     >     > zhengda1936@gmail.com>
>     >     >     > wrote:
>     >     >     >     > >
>     >     >     >     > > > valgrind doesn't work with Python. also, valgrind
>     > doesn't
>     >     >     > support some
>     >     >     >     > > > CPU instructions used by MXNet (I think some
>     > instructions
>     >     >     > related to
>     >     >     >     > > > random generator).
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>     >     >     > bhavinthaker@gmail.com>
>     >     >     >     > > > wrote:
>     >     >     >     > > > > Have you tried running with valgrind to get some
>     > clues
>     >     > on the
>     >     >     >     > > root-cause?
>     >     >     >     > > > >
>     >     >     >     > > > > Bhavin Thaker.
>     >     >     >     > > > >
>     >     >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
>     >     > zhengda1936@gmail.com
>     >     >     > >
>     >     >     >     > wrote:
>     >     >     >     > > > >
>     >     >     >     > > > >> It might also be possible that this isn't an
>     > MKLDNN bug.
>     >     >     >     > > > >> I just saw a similar memory error without MKLDNN
>     > build.
>     >     >     >     > > > >>
>     >     >     >     > > > >>
>     >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     >     > organizations/jenkins/
>     >     >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
>     >     >     >     > > > >>
>     >     >     >     > > > >> Best,
>     >     >     >     > > > >> Da
>     >     >     >     > > > >>
>     >     >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
>     >     > dzzhen@amazon.com>
>     >     >     >     > wrote:
>     >     >     >     > > > >> > There might be a race condition that causes the
>     > memory
>     >     >     > error.
>     >     >     >     > > > >> > It might be caused by this PR:
>     >     >     >     > > > >> > https://github.com/apache/
>     > incubator-mxnet/pull/10706/
>     >     > files
>     >     >     >     > > > >> > This PR removes MKLDNN memory from NDArray.
>     >     >     >     > > > >> > However, I don't know why this causes memory
>     > error. If
>     >     >     > someone is
>     >     >     >     > > > using
>     >     >     >     > > > >> the memory, it should still hold the memory with
>     > shared
>     >     >     > pointer.
>     >     >     >     > > > >> > But I do see the memory error increase after
>     > this PR
>     >     > is
>     >     >     > merged.
>     >     >     >     > > > >> >
>     >     >     >     > > > >> > Best,
>     >     >     >     > > > >> > Da
>     >     >     >     > > > >> >
>     >     >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>     >     >     >     > pedro.larroy.lists@gmail.com>
>     >     >     >     > > > >> wrote:
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >     I couldn't reproduce locally with:
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >     ci/build.py -p ubuntu_cpu
>     >     > /work/runtime_functions.sh
>     >     >     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py
>     > --platform
>     >     >     > ubuntu_cpu
>     >     >     >     > > > >> >     /work/runtime_functions.sh
>     >     > unittest_ubuntu_python2_cpu
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro
>     > Larroy <
>     >     >     >     > > > >> pedro.larroy.lists@gmail.com>
>     >     >     >     > > > >> >     wrote:
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >     > Hi
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     > Seems master is not running  anymore,
>     > there's a
>     >     >     > segmentation
>     >     >     >     > > > fault
>     >     >     >     > > > >> using
>     >     >     >     > > > >> >     > MKDLNN-CPU
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     >
>     >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     >     > organizations/jenkins/
>     >     >     >     > > > >> >     > incubator-mxnet/detail/master/
>     > 801/pipeline/662
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     > I see my PRs failing with a similar error.
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     > Pedro
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >
>     >     >     >     > > > >>
>     >     >     >     > > >
>     >     >     >     > >
>     >     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>
>

Re: segmentation fault in master using mkdlnn

Posted by "Zheng, Da" <dz...@amazon.com>.

Hello Pedro,

I did exactly what you said in your previous email.

I edit ci/docker/runtime_functions.sh based on your patch and here is the history of running your commands:
 2004  vim ci/docker/runtime_functions.sh 
 2005  ci/docker/runtime_functions.sh clean_repo
 2006  ci/build.py -p ubuntu_cpu /work/runtime_functions.sh build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python2_cpu

Best,
Da

On 5/4/18, 4:32 AM, "Pedro Larroy" <pe...@gmail.com> wrote:

    Hi Da. I run it both in my ubuntu 16.04 workstation, in a p3 instance with
    DLAMI. I'm pretty confident it runs in most linux environments.
    
    Can you post the exact commands that you run? is not clear to me what's the
    problem from your paste. Please make sure your repo is clean and all your
    subrepos are clean before starting the docker build.
    
    ci/docker/runtime_functions.sh clean_repo
    
    Pedro.
    
    On Thu, May 3, 2018 at 7:17 PM, Zheng, Da <dz...@amazon.com> wrote:
    
    > Hello Pedro,
    >
    > I tried your instructions. It seems I can't run the docker in EC2
    > instances.
    > Where did you reproduce the error?
    >
    > Thanks,
    > Da
    >
    > + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
    > + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
    > gpg: directory `/root/.gnupg' created
    > gpg: new configuration file `/root/.gnupg/gpg.conf' created
    > gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during
    > this run
    > gpg: keyring `/root/.gnupg/secring.gpg' created
    > gpg: keyring `/root/.gnupg/pubring.gpg' created
    > gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
    > gpg: keyserver timed out
    > gpg: keyserver receive failed: keyserver error
    > The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
    > Traceback (most recent call last):
    >   File "ci/build.py", line 263, in <module>
    >     sys.exit(main())
    >   File "ci/build.py", line 197, in main
    >     build_docker(platform, docker_binary)
    >   File "ci/build.py", line 73, in build_docker
    >     check_call(cmd)
    >   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    >     raise CalledProcessError(retcode, cmd)
    > subprocess.CalledProcessError: Command '['docker', 'build', '-f',
    > 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
    > '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2
    >
    >
    > On 5/3/18, 8:01 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
    >
    >     Hi Da
    >
    >     Reproduction instructions:
    >
    >     On the host:
    >
    >     Adjust core pattern:
    >
    >     $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
    >
    >
    >     Use the following patch:
    >
    >     ===============
    >
    >     diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
    >     --- a/3rdparty/mkldnn
    >     +++ b/3rdparty/mkldnn
    >     @@ -1 +1 @@
    >     -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
    >     +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
    >     diff --git a/ci/docker/runtime_functions.sh
    > b/ci/docker/runtime_functions.sh
    >     index 027e287..62649c9 100755
    >     --- a/ci/docker/runtime_functions.sh
    >     +++ b/ci/docker/runtime_functions.sh
    >     @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
    >          # https://github.com/apache/incubator-mxnet/issues/10026
    >          #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
    >          export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    >     -    nosetests-2.7 --verbose tests/python/unittest
    >     -    nosetests-2.7 --verbose tests/python/train
    >     -    nosetests-2.7 --verbose tests/python/quantization
    >     +    export MXNET_TEST_SEED=11
    >     +    export MXNET_MODULE_SEED=812478194
    >     +    pwd
    >     +    export MXNET_TEST_COUNT=10000
    >     +    ulimit -c unlimited
    >     +    ulimit -c
    >     +    while nosetests-2.7 --verbose
    >     tests/python/unittest/test_module.py:test_forward_reshape; do echo
    > round;
    >     done
    >     +    #nosetests-2.7 --verbose tests/python/train
    >     +    #nosetests-2.7 --verbose tests/python/quantization
    >      }
    >
    >      unittest_ubuntu_python3_cpu() {
    >
    >
    >
    >     ==============
    >
    >     Build and execute the test, make sure the repo is clean
    >
    >     $ ci/docker/runtime_functions.sh clean_repo
    >
    >     $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
    >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
    >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
    >
    >
    >     Once it crashes it will stop.
    >
    >     Then go in the container:
    >
    >
    >     $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
    >
    >     A core should be there.
    >
    >     you might need to install gdb as root by executing the previous command
    >     without uid so you can use apt-get.
    >
    >
    >
    >
    >     Good luck.
    >
    >
    >
    >
    >
    >
    >
    >     On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dz...@amazon.com> wrote:
    >
    >     > Thanks a lot for locating the error.
    >     > Could you tell me How you reproduce the error?
    >     >
    >     > On 5/3/18, 7:45 AM, "Pedro Larroy" <pe...@gmail.com>
    > wrote:
    >     >
    >     >     Looks like a problem in mkl's same_shape
    >     >
    >     >     the pointer to mkldnn::memory::desc &desc  looks invalid.
    >     >
    >     >     (More stack frames follow...)
    >     >     (gdb) p desc
    >     >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading
    > variable>
    >     >     (gdb) p dtype
    >     >     $2 = 0
    >     >     (gdb) p shape
    >     >     $3 = (const mxnet::TShape &) @0x7f3905a58b50:
    > {<nnvm::Tuple<long>> =
    >     >     {static kStackCache = <optimized out>, ndim_ = 2,
    > num_heap_allocated_
    >     > = 0,
    >     >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ =
    > 0x0}, <No
    >     > data
    >     >     fields>}
    >     >     (gdb)
    >     >
    >     >
    >     >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com>
    > wrote:
    >     >
    >     >     > There are a few problems with valgrind, which makes it not an
    > ideal
    >     > tool
    >     >     > for mxnet with python interface.
    >     >     >
    >     >     > First, valgrind generates a huge number of irrelevant
    > messages, most
    >     > of
    >     >     > them from in Python itself.
    >     >     >
    >     >     > Second, valgrind can't emulate all CPU instructions. I
    > remember that
    >     > when
    >     >     > I run valgrind with mxnet, valgrind exits with a strange
    > error. I
    >     > later on
    >     >     > found that it was caused by an unsupported CPU instructions.
    >     >     >
    >     >     > Third, valgrind doesn't support multithreading well. As far as
    > I
    >     > know,
    >     >     > valgrind runs everything in a single thread even if the
    > program uses
    >     >     > multi-threading. An error like this, which is likely caused by
    > race
    >     >     > condition, can't be caught by valgrind.
    >     >     >
    >     >     > I used to use Address Sanitizer for memory errors. This tool
    > is much
    >     >     > faster and can work with multi-threads. However, it doesn't
    > work with
    >     >     > Python for some reason.
    >     >     >
    >     >     > One thing we potentially can do is to use memory checker for
    > C++ unit
    >     >     > tests. Not sure it'll cover all memory errors we want.
    >     >     >
    >     >     > Best,
    >     >     > Da
    >     >     >
    >     >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <
    > pedro.larroy.lists@gmail.com>
    >     > wrote:
    >     >     >
    >     >     >     It's very difficult to reproduce, non-deterministic. We
    > were also
    >     >     > running
    >     >     >     without signal handlers in CI so there are no stack traces
    >     >     > unfortunately.
    >     >     >
    >     >     >     Care to elaborate why valgrind doesn't work with Python?
    >     >     >
    >     >     >
    >     >     >
    >     >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <
    > zhengda1936@gmail.com>
    >     >     > wrote:
    >     >     >
    >     >     >     > can we build it in CI？segfault doesn't happen
    > infrequently.
    >     >     >     >
    >     >     >     > 2018年5月2日 下午11:34，"Chris Olivier" <cjolivier01@gmail.com
    > >写道：
    >     >     >     >
    >     >     >     > > you can try Intel Inspector, which is like an enhanced
    >     > version of
    >     >     >     > valgrind
    >     >     >     > > with a GUI and whatnot.
    >     >     >     > >
    >     >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
    >     > zhengda1936@gmail.com>
    >     >     > wrote:
    >     >     >     > >
    >     >     >     > > > valgrind doesn't work with Python. also, valgrind
    > doesn't
    >     >     > support some
    >     >     >     > > > CPU instructions used by MXNet (I think some
    > instructions
    >     >     > related to
    >     >     >     > > > random generator).
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
    >     >     > bhavinthaker@gmail.com>
    >     >     >     > > > wrote:
    >     >     >     > > > > Have you tried running with valgrind to get some
    > clues
    >     > on the
    >     >     >     > > root-cause?
    >     >     >     > > > >
    >     >     >     > > > > Bhavin Thaker.
    >     >     >     > > > >
    >     >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
    >     > zhengda1936@gmail.com
    >     >     > >
    >     >     >     > wrote:
    >     >     >     > > > >
    >     >     >     > > > >> It might also be possible that this isn't an
    > MKLDNN bug.
    >     >     >     > > > >> I just saw a similar memory error without MKLDNN
    > build.
    >     >     >     > > > >>
    >     >     >     > > > >>
    >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    >     >     > organizations/jenkins/
    >     >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
    >     >     >     > > > >>
    >     >     >     > > > >> Best,
    >     >     >     > > > >> Da
    >     >     >     > > > >>
    >     >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
    >     > dzzhen@amazon.com>
    >     >     >     > wrote:
    >     >     >     > > > >> > There might be a race condition that causes the
    > memory
    >     >     > error.
    >     >     >     > > > >> > It might be caused by this PR:
    >     >     >     > > > >> > https://github.com/apache/
    > incubator-mxnet/pull/10706/
    >     > files
    >     >     >     > > > >> > This PR removes MKLDNN memory from NDArray.
    >     >     >     > > > >> > However, I don't know why this causes memory
    > error. If
    >     >     > someone is
    >     >     >     > > > using
    >     >     >     > > > >> the memory, it should still hold the memory with
    > shared
    >     >     > pointer.
    >     >     >     > > > >> > But I do see the memory error increase after
    > this PR
    >     > is
    >     >     > merged.
    >     >     >     > > > >> >
    >     >     >     > > > >> > Best,
    >     >     >     > > > >> > Da
    >     >     >     > > > >> >
    >     >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
    >     >     >     > pedro.larroy.lists@gmail.com>
    >     >     >     > > > >> wrote:
    >     >     >     > > > >> >
    >     >     >     > > > >> >     I couldn't reproduce locally with:
    >     >     >     > > > >> >
    >     >     >     > > > >> >     ci/build.py -p ubuntu_cpu
    >     > /work/runtime_functions.sh
    >     >     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py
    > --platform
    >     >     > ubuntu_cpu
    >     >     >     > > > >> >     /work/runtime_functions.sh
    >     > unittest_ubuntu_python2_cpu
    >     >     >     > > > >> >
    >     >     >     > > > >> >
    >     >     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro
    > Larroy <
    >     >     >     > > > >> pedro.larroy.lists@gmail.com>
    >     >     >     > > > >> >     wrote:
    >     >     >     > > > >> >
    >     >     >     > > > >> >     > Hi
    >     >     >     > > > >> >     >
    >     >     >     > > > >> >     > Seems master is not running  anymore,
    > there's a
    >     >     > segmentation
    >     >     >     > > > fault
    >     >     >     > > > >> using
    >     >     >     > > > >> >     > MKDLNN-CPU
    >     >     >     > > > >> >     >
    >     >     >     > > > >> >     >
    >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    >     >     > organizations/jenkins/
    >     >     >     > > > >> >     > incubator-mxnet/detail/master/
    > 801/pipeline/662
    >     >     >     > > > >> >     >
    >     >     >     > > > >> >     >
    >     >     >     > > > >> >     > I see my PRs failing with a similar error.
    >     >     >     > > > >> >     >
    >     >     >     > > > >> >     > Pedro
    >     >     >     > > > >> >     >
    >     >     >     > > > >> >
    >     >     >     > > > >> >
    >     >     >     > > > >>
    >     >     >     > > >
    >     >     >     > >
    >     >     >     >
    >     >     >
    >     >     >
    >     >     >
    >     >
    >     >
    >     >
    >
    >
    >

Re: segmentation fault in master using mkdlnn

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Da. I run it both in my ubuntu 16.04 workstation, in a p3 instance with
DLAMI. I'm pretty confident it runs in most linux environments.

Can you post the exact commands that you run? is not clear to me what's the
problem from your paste. Please make sure your repo is clean and all your
subrepos are clean before starting the docker build.

ci/docker/runtime_functions.sh clean_repo

Pedro.

On Thu, May 3, 2018 at 7:17 PM, Zheng, Da <dz...@amazon.com> wrote:

> Hello Pedro,
>
> I tried your instructions. It seems I can't run the docker in EC2
> instances.
> Where did you reproduce the error?
>
> Thanks,
> Da
>
> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
> gpg: directory `/root/.gnupg' created
> gpg: new configuration file `/root/.gnupg/gpg.conf' created
> gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during
> this run
> gpg: keyring `/root/.gnupg/secring.gpg' created
> gpg: keyring `/root/.gnupg/pubring.gpg' created
> gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
> gpg: keyserver timed out
> gpg: keyserver receive failed: keyserver error
> The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
> Traceback (most recent call last):
>   File "ci/build.py", line 263, in <module>
>     sys.exit(main())
>   File "ci/build.py", line 197, in main
>     build_docker(platform, docker_binary)
>   File "ci/build.py", line 73, in build_docker
>     check_call(cmd)
>   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
>     raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '['docker', 'build', '-f',
> 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
> '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2
>
>
> On 5/3/18, 8:01 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
>
>     Hi Da
>
>     Reproduction instructions:
>
>     On the host:
>
>     Adjust core pattern:
>
>     $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
>
>
>     Use the following patch:
>
>     ===============
>
>     diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
>     --- a/3rdparty/mkldnn
>     +++ b/3rdparty/mkldnn
>     @@ -1 +1 @@
>     -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
>     +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
>     diff --git a/ci/docker/runtime_functions.sh
> b/ci/docker/runtime_functions.sh
>     index 027e287..62649c9 100755
>     --- a/ci/docker/runtime_functions.sh
>     +++ b/ci/docker/runtime_functions.sh
>     @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
>          # https://github.com/apache/incubator-mxnet/issues/10026
>          #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
>          export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
>     -    nosetests-2.7 --verbose tests/python/unittest
>     -    nosetests-2.7 --verbose tests/python/train
>     -    nosetests-2.7 --verbose tests/python/quantization
>     +    export MXNET_TEST_SEED=11
>     +    export MXNET_MODULE_SEED=812478194
>     +    pwd
>     +    export MXNET_TEST_COUNT=10000
>     +    ulimit -c unlimited
>     +    ulimit -c
>     +    while nosetests-2.7 --verbose
>     tests/python/unittest/test_module.py:test_forward_reshape; do echo
> round;
>     done
>     +    #nosetests-2.7 --verbose tests/python/train
>     +    #nosetests-2.7 --verbose tests/python/quantization
>      }
>
>      unittest_ubuntu_python3_cpu() {
>
>
>
>     ==============
>
>     Build and execute the test, make sure the repo is clean
>
>     $ ci/docker/runtime_functions.sh clean_repo
>
>     $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
>
>     Once it crashes it will stop.
>
>     Then go in the container:
>
>
>     $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
>
>     A core should be there.
>
>     you might need to install gdb as root by executing the previous command
>     without uid so you can use apt-get.
>
>
>
>
>     Good luck.
>
>
>
>
>
>
>
>     On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dz...@amazon.com> wrote:
>
>     > Thanks a lot for locating the error.
>     > Could you tell me How you reproduce the error?
>     >
>     > On 5/3/18, 7:45 AM, "Pedro Larroy" <pe...@gmail.com>
> wrote:
>     >
>     >     Looks like a problem in mkl's same_shape
>     >
>     >     the pointer to mkldnn::memory::desc &desc  looks invalid.
>     >
>     >     (More stack frames follow...)
>     >     (gdb) p desc
>     >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading
> variable>
>     >     (gdb) p dtype
>     >     $2 = 0
>     >     (gdb) p shape
>     >     $3 = (const mxnet::TShape &) @0x7f3905a58b50:
> {<nnvm::Tuple<long>> =
>     >     {static kStackCache = <optimized out>, ndim_ = 2,
> num_heap_allocated_
>     > = 0,
>     >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ =
> 0x0}, <No
>     > data
>     >     fields>}
>     >     (gdb)
>     >
>     >
>     >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com>
> wrote:
>     >
>     >     > There are a few problems with valgrind, which makes it not an
> ideal
>     > tool
>     >     > for mxnet with python interface.
>     >     >
>     >     > First, valgrind generates a huge number of irrelevant
> messages, most
>     > of
>     >     > them from in Python itself.
>     >     >
>     >     > Second, valgrind can't emulate all CPU instructions. I
> remember that
>     > when
>     >     > I run valgrind with mxnet, valgrind exits with a strange
> error. I
>     > later on
>     >     > found that it was caused by an unsupported CPU instructions.
>     >     >
>     >     > Third, valgrind doesn't support multithreading well. As far as
> I
>     > know,
>     >     > valgrind runs everything in a single thread even if the
> program uses
>     >     > multi-threading. An error like this, which is likely caused by
> race
>     >     > condition, can't be caught by valgrind.
>     >     >
>     >     > I used to use Address Sanitizer for memory errors. This tool
> is much
>     >     > faster and can work with multi-threads. However, it doesn't
> work with
>     >     > Python for some reason.
>     >     >
>     >     > One thing we potentially can do is to use memory checker for
> C++ unit
>     >     > tests. Not sure it'll cover all memory errors we want.
>     >     >
>     >     > Best,
>     >     > Da
>     >     >
>     >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <
> pedro.larroy.lists@gmail.com>
>     > wrote:
>     >     >
>     >     >     It's very difficult to reproduce, non-deterministic. We
> were also
>     >     > running
>     >     >     without signal handlers in CI so there are no stack traces
>     >     > unfortunately.
>     >     >
>     >     >     Care to elaborate why valgrind doesn't work with Python?
>     >     >
>     >     >
>     >     >
>     >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <
> zhengda1936@gmail.com>
>     >     > wrote:
>     >     >
>     >     >     > can we build it in CI？segfault doesn't happen
> infrequently.
>     >     >     >
>     >     >     > 2018年5月2日 下午11:34，"Chris Olivier" <cjolivier01@gmail.com
> >写道：
>     >     >     >
>     >     >     > > you can try Intel Inspector, which is like an enhanced
>     > version of
>     >     >     > valgrind
>     >     >     > > with a GUI and whatnot.
>     >     >     > >
>     >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
>     > zhengda1936@gmail.com>
>     >     > wrote:
>     >     >     > >
>     >     >     > > > valgrind doesn't work with Python. also, valgrind
> doesn't
>     >     > support some
>     >     >     > > > CPU instructions used by MXNet (I think some
> instructions
>     >     > related to
>     >     >     > > > random generator).
>     >     >     > > >
>     >     >     > > >
>     >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>     >     > bhavinthaker@gmail.com>
>     >     >     > > > wrote:
>     >     >     > > > > Have you tried running with valgrind to get some
> clues
>     > on the
>     >     >     > > root-cause?
>     >     >     > > > >
>     >     >     > > > > Bhavin Thaker.
>     >     >     > > > >
>     >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
>     > zhengda1936@gmail.com
>     >     > >
>     >     >     > wrote:
>     >     >     > > > >
>     >     >     > > > >> It might also be possible that this isn't an
> MKLDNN bug.
>     >     >     > > > >> I just saw a similar memory error without MKLDNN
> build.
>     >     >     > > > >>
>     >     >     > > > >>
>     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     > organizations/jenkins/
>     >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
>     >     >     > > > >>
>     >     >     > > > >> Best,
>     >     >     > > > >> Da
>     >     >     > > > >>
>     >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
>     > dzzhen@amazon.com>
>     >     >     > wrote:
>     >     >     > > > >> > There might be a race condition that causes the
> memory
>     >     > error.
>     >     >     > > > >> > It might be caused by this PR:
>     >     >     > > > >> > https://github.com/apache/
> incubator-mxnet/pull/10706/
>     > files
>     >     >     > > > >> > This PR removes MKLDNN memory from NDArray.
>     >     >     > > > >> > However, I don't know why this causes memory
> error. If
>     >     > someone is
>     >     >     > > > using
>     >     >     > > > >> the memory, it should still hold the memory with
> shared
>     >     > pointer.
>     >     >     > > > >> > But I do see the memory error increase after
> this PR
>     > is
>     >     > merged.
>     >     >     > > > >> >
>     >     >     > > > >> > Best,
>     >     >     > > > >> > Da
>     >     >     > > > >> >
>     >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>     >     >     > pedro.larroy.lists@gmail.com>
>     >     >     > > > >> wrote:
>     >     >     > > > >> >
>     >     >     > > > >> >     I couldn't reproduce locally with:
>     >     >     > > > >> >
>     >     >     > > > >> >     ci/build.py -p ubuntu_cpu
>     > /work/runtime_functions.sh
>     >     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py
> --platform
>     >     > ubuntu_cpu
>     >     >     > > > >> >     /work/runtime_functions.sh
>     > unittest_ubuntu_python2_cpu
>     >     >     > > > >> >
>     >     >     > > > >> >
>     >     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro
> Larroy <
>     >     >     > > > >> pedro.larroy.lists@gmail.com>
>     >     >     > > > >> >     wrote:
>     >     >     > > > >> >
>     >     >     > > > >> >     > Hi
>     >     >     > > > >> >     >
>     >     >     > > > >> >     > Seems master is not running  anymore,
> there's a
>     >     > segmentation
>     >     >     > > > fault
>     >     >     > > > >> using
>     >     >     > > > >> >     > MKDLNN-CPU
>     >     >     > > > >> >     >
>     >     >     > > > >> >     >
>     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     > organizations/jenkins/
>     >     >     > > > >> >     > incubator-mxnet/detail/master/
> 801/pipeline/662
>     >     >     > > > >> >     >
>     >     >     > > > >> >     >
>     >     >     > > > >> >     > I see my PRs failing with a similar error.
>     >     >     > > > >> >     >
>     >     >     > > > >> >     > Pedro
>     >     >     > > > >> >     >
>     >     >     > > > >> >
>     >     >     > > > >> >
>     >     >     > > > >>
>     >     >     > > >
>     >     >     > >
>     >     >     >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>
>
>

Re: segmentation fault in master using mkdlnn

Posted by Marco de Abreu <ma...@googlemail.com>.

Da, it seems like you have a problem with your internet connection, leading
to a timeout to the keyserver.

-Marco

On Thu, May 3, 2018 at 8:20 PM, Anirudh <an...@gmail.com> wrote:

> Hi Pedro and Da,
>
> I am not sure how to install mkldnn with cmake. But for make to reproduce
> you can do the following:
>
> make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> export MXNET_TEST_SEED=11
> export MXNET_MODULE_SEED=812478194
> export MXNET_TEST_COUNT=10000
> nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape
>
> I was able to reproduce on master, now trying on 1.2 branch.
>
> Anirudh
>
>
> On Thu, May 3, 2018 at 10:17 AM, Zheng, Da <dz...@amazon.com> wrote:
>
> > Hello Pedro,
> >
> > I tried your instructions. It seems I can't run the docker in EC2
> > instances.
> > Where did you reproduce the error?
> >
> > Thanks,
> > Da
> >
> > + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> > + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
> > gpg: directory `/root/.gnupg' created
> > gpg: new configuration file `/root/.gnupg/gpg.conf' created
> > gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active
> during
> > this run
> > gpg: keyring `/root/.gnupg/secring.gpg' created
> > gpg: keyring `/root/.gnupg/pubring.gpg' created
> > gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
> > gpg: keyserver timed out
> > gpg: keyserver receive failed: keyserver error
> > The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
> > Traceback (most recent call last):
> >   File "ci/build.py", line 263, in <module>
> >     sys.exit(main())
> >   File "ci/build.py", line 197, in main
> >     build_docker(platform, docker_binary)
> >   File "ci/build.py", line 73, in build_docker
> >     check_call(cmd)
> >   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
> >     raise CalledProcessError(retcode, cmd)
> > subprocess.CalledProcessError: Command '['docker', 'build', '-f',
> > 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
> > '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status
> 2
> >
> >
> > On 5/3/18, 8:01 AM, "Pedro Larroy" <pe...@gmail.com>
> wrote:
> >
> >     Hi Da
> >
> >     Reproduction instructions:
> >
> >     On the host:
> >
> >     Adjust core pattern:
> >
> >     $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
> >
> >
> >     Use the following patch:
> >
> >     ===============
> >
> >     diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
> >     --- a/3rdparty/mkldnn
> >     +++ b/3rdparty/mkldnn
> >     @@ -1 +1 @@
> >     -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
> >     +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
> >     diff --git a/ci/docker/runtime_functions.sh
> > b/ci/docker/runtime_functions.sh
> >     index 027e287..62649c9 100755
> >     --- a/ci/docker/runtime_functions.sh
> >     +++ b/ci/docker/runtime_functions.sh
> >     @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
> >          # https://github.com/apache/incubator-mxnet/issues/10026
> >          #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
> >          export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> >     -    nosetests-2.7 --verbose tests/python/unittest
> >     -    nosetests-2.7 --verbose tests/python/train
> >     -    nosetests-2.7 --verbose tests/python/quantization
> >     +    export MXNET_TEST_SEED=11
> >     +    export MXNET_MODULE_SEED=812478194
> >     +    pwd
> >     +    export MXNET_TEST_COUNT=10000
> >     +    ulimit -c unlimited
> >     +    ulimit -c
> >     +    while nosetests-2.7 --verbose
> >     tests/python/unittest/test_module.py:test_forward_reshape; do echo
> > round;
> >     done
> >     +    #nosetests-2.7 --verbose tests/python/train
> >     +    #nosetests-2.7 --verbose tests/python/quantization
> >      }
> >
> >      unittest_ubuntu_python3_cpu() {
> >
> >
> >
> >     ==============
> >
> >     Build and execute the test, make sure the repo is clean
> >
> >     $ ci/docker/runtime_functions.sh clean_repo
> >
> >     $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> >
> >
> >     Once it crashes it will stop.
> >
> >     Then go in the container:
> >
> >
> >     $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
> >
> >     A core should be there.
> >
> >     you might need to install gdb as root by executing the previous
> command
> >     without uid so you can use apt-get.
> >
> >
> >
> >
> >     Good luck.
> >
> >
> >
> >
> >
> >
> >
> >     On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dz...@amazon.com> wrote:
> >
> >     > Thanks a lot for locating the error.
> >     > Could you tell me How you reproduce the error?
> >     >
> >     > On 5/3/18, 7:45 AM, "Pedro Larroy" <pe...@gmail.com>
> > wrote:
> >     >
> >     >     Looks like a problem in mkl's same_shape
> >     >
> >     >     the pointer to mkldnn::memory::desc &desc  looks invalid.
> >     >
> >     >     (More stack frames follow...)
> >     >     (gdb) p desc
> >     >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading
> > variable>
> >     >     (gdb) p dtype
> >     >     $2 = 0
> >     >     (gdb) p shape
> >     >     $3 = (const mxnet::TShape &) @0x7f3905a58b50:
> > {<nnvm::Tuple<long>> =
> >     >     {static kStackCache = <optimized out>, ndim_ = 2,
> > num_heap_allocated_
> >     > = 0,
> >     >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ =
> > 0x0}, <No
> >     > data
> >     >     fields>}
> >     >     (gdb)
> >     >
> >     >
> >     >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com>
> > wrote:
> >     >
> >     >     > There are a few problems with valgrind, which makes it not an
> > ideal
> >     > tool
> >     >     > for mxnet with python interface.
> >     >     >
> >     >     > First, valgrind generates a huge number of irrelevant
> > messages, most
> >     > of
> >     >     > them from in Python itself.
> >     >     >
> >     >     > Second, valgrind can't emulate all CPU instructions. I
> > remember that
> >     > when
> >     >     > I run valgrind with mxnet, valgrind exits with a strange
> > error. I
> >     > later on
> >     >     > found that it was caused by an unsupported CPU instructions.
> >     >     >
> >     >     > Third, valgrind doesn't support multithreading well. As far
> as
> > I
> >     > know,
> >     >     > valgrind runs everything in a single thread even if the
> > program uses
> >     >     > multi-threading. An error like this, which is likely caused
> by
> > race
> >     >     > condition, can't be caught by valgrind.
> >     >     >
> >     >     > I used to use Address Sanitizer for memory errors. This tool
> > is much
> >     >     > faster and can work with multi-threads. However, it doesn't
> > work with
> >     >     > Python for some reason.
> >     >     >
> >     >     > One thing we potentially can do is to use memory checker for
> > C++ unit
> >     >     > tests. Not sure it'll cover all memory errors we want.
> >     >     >
> >     >     > Best,
> >     >     > Da
> >     >     >
> >     >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <
> > pedro.larroy.lists@gmail.com>
> >     > wrote:
> >     >     >
> >     >     >     It's very difficult to reproduce, non-deterministic. We
> > were also
> >     >     > running
> >     >     >     without signal handlers in CI so there are no stack
> traces
> >     >     > unfortunately.
> >     >     >
> >     >     >     Care to elaborate why valgrind doesn't work with Python?
> >     >     >
> >     >     >
> >     >     >
> >     >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <
> > zhengda1936@gmail.com>
> >     >     > wrote:
> >     >     >
> >     >     >     > can we build it in CI？segfault doesn't happen
> > infrequently.
> >     >     >     >
> >     >     >     > 2018年5月2日 下午11:34，"Chris Olivier" <
> cjolivier01@gmail.com
> > >写道：
> >     >     >     >
> >     >     >     > > you can try Intel Inspector, which is like an
> enhanced
> >     > version of
> >     >     >     > valgrind
> >     >     >     > > with a GUI and whatnot.
> >     >     >     > >
> >     >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
> >     > zhengda1936@gmail.com>
> >     >     > wrote:
> >     >     >     > >
> >     >     >     > > > valgrind doesn't work with Python. also, valgrind
> > doesn't
> >     >     > support some
> >     >     >     > > > CPU instructions used by MXNet (I think some
> > instructions
> >     >     > related to
> >     >     >     > > > random generator).
> >     >     >     > > >
> >     >     >     > > >
> >     >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
> >     >     > bhavinthaker@gmail.com>
> >     >     >     > > > wrote:
> >     >     >     > > > > Have you tried running with valgrind to get some
> > clues
> >     > on the
> >     >     >     > > root-cause?
> >     >     >     > > > >
> >     >     >     > > > > Bhavin Thaker.
> >     >     >     > > > >
> >     >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
> >     > zhengda1936@gmail.com
> >     >     > >
> >     >     >     > wrote:
> >     >     >     > > > >
> >     >     >     > > > >> It might also be possible that this isn't an
> > MKLDNN bug.
> >     >     >     > > > >> I just saw a similar memory error without MKLDNN
> > build.
> >     >     >     > > > >>
> >     >     >     > > > >>
> >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> >     >     > organizations/jenkins/
> >     >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
> >     >     >     > > > >>
> >     >     >     > > > >> Best,
> >     >     >     > > > >> Da
> >     >     >     > > > >>
> >     >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
> >     > dzzhen@amazon.com>
> >     >     >     > wrote:
> >     >     >     > > > >> > There might be a race condition that causes
> the
> > memory
> >     >     > error.
> >     >     >     > > > >> > It might be caused by this PR:
> >     >     >     > > > >> > https://github.com/apache/
> > incubator-mxnet/pull/10706/
> >     > files
> >     >     >     > > > >> > This PR removes MKLDNN memory from NDArray.
> >     >     >     > > > >> > However, I don't know why this causes memory
> > error. If
> >     >     > someone is
> >     >     >     > > > using
> >     >     >     > > > >> the memory, it should still hold the memory with
> > shared
> >     >     > pointer.
> >     >     >     > > > >> > But I do see the memory error increase after
> > this PR
> >     > is
> >     >     > merged.
> >     >     >     > > > >> >
> >     >     >     > > > >> > Best,
> >     >     >     > > > >> > Da
> >     >     >     > > > >> >
> >     >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
> >     >     >     > pedro.larroy.lists@gmail.com>
> >     >     >     > > > >> wrote:
> >     >     >     > > > >> >
> >     >     >     > > > >> >     I couldn't reproduce locally with:
> >     >     >     > > > >> >
> >     >     >     > > > >> >     ci/build.py -p ubuntu_cpu
> >     > /work/runtime_functions.sh
> >     >     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py
> > --platform
> >     >     > ubuntu_cpu
> >     >     >     > > > >> >     /work/runtime_functions.sh
> >     > unittest_ubuntu_python2_cpu
> >     >     >     > > > >> >
> >     >     >     > > > >> >
> >     >     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro
> > Larroy <
> >     >     >     > > > >> pedro.larroy.lists@gmail.com>
> >     >     >     > > > >> >     wrote:
> >     >     >     > > > >> >
> >     >     >     > > > >> >     > Hi
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     > Seems master is not running  anymore,
> > there's a
> >     >     > segmentation
> >     >     >     > > > fault
> >     >     >     > > > >> using
> >     >     >     > > > >> >     > MKDLNN-CPU
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     >
> >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> >     >     > organizations/jenkins/
> >     >     >     > > > >> >     > incubator-mxnet/detail/master/
> > 801/pipeline/662
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     > I see my PRs failing with a similar
> error.
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     > Pedro
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >
> >     >     >     > > > >> >
> >     >     >     > > > >>
> >     >     >     > > >
> >     >     >     > >
> >     >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >
> >     >
> >     >
> >
> >
> >
>

Re: segmentation fault in master using mkdlnn

Posted by Anirudh <an...@gmail.com>.

Hi Pedro and Da,

I am not sure how to install mkldnn with cmake. But for make to reproduce
you can do the following:

make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
export MXNET_TEST_SEED=11
export MXNET_MODULE_SEED=812478194
export MXNET_TEST_COUNT=10000
nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape

I was able to reproduce on master, now trying on 1.2 branch.

Anirudh


On Thu, May 3, 2018 at 10:17 AM, Zheng, Da <dz...@amazon.com> wrote:

> Hello Pedro,
>
> I tried your instructions. It seems I can't run the docker in EC2
> instances.
> Where did you reproduce the error?
>
> Thanks,
> Da
>
> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
> gpg: directory `/root/.gnupg' created
> gpg: new configuration file `/root/.gnupg/gpg.conf' created
> gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during
> this run
> gpg: keyring `/root/.gnupg/secring.gpg' created
> gpg: keyring `/root/.gnupg/pubring.gpg' created
> gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
> gpg: keyserver timed out
> gpg: keyserver receive failed: keyserver error
> The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
> Traceback (most recent call last):
>   File "ci/build.py", line 263, in <module>
>     sys.exit(main())
>   File "ci/build.py", line 197, in main
>     build_docker(platform, docker_binary)
>   File "ci/build.py", line 73, in build_docker
>     check_call(cmd)
>   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
>     raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '['docker', 'build', '-f',
> 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
> '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2
>
>
> On 5/3/18, 8:01 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
>
>     Hi Da
>
>     Reproduction instructions:
>
>     On the host:
>
>     Adjust core pattern:
>
>     $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
>
>
>     Use the following patch:
>
>     ===============
>
>     diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
>     --- a/3rdparty/mkldnn
>     +++ b/3rdparty/mkldnn
>     @@ -1 +1 @@
>     -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
>     +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
>     diff --git a/ci/docker/runtime_functions.sh
> b/ci/docker/runtime_functions.sh
>     index 027e287..62649c9 100755
>     --- a/ci/docker/runtime_functions.sh
>     +++ b/ci/docker/runtime_functions.sh
>     @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
>          # https://github.com/apache/incubator-mxnet/issues/10026
>          #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
>          export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
>     -    nosetests-2.7 --verbose tests/python/unittest
>     -    nosetests-2.7 --verbose tests/python/train
>     -    nosetests-2.7 --verbose tests/python/quantization
>     +    export MXNET_TEST_SEED=11
>     +    export MXNET_MODULE_SEED=812478194
>     +    pwd
>     +    export MXNET_TEST_COUNT=10000
>     +    ulimit -c unlimited
>     +    ulimit -c
>     +    while nosetests-2.7 --verbose
>     tests/python/unittest/test_module.py:test_forward_reshape; do echo
> round;
>     done
>     +    #nosetests-2.7 --verbose tests/python/train
>     +    #nosetests-2.7 --verbose tests/python/quantization
>      }
>
>      unittest_ubuntu_python3_cpu() {
>
>
>
>     ==============
>
>     Build and execute the test, make sure the repo is clean
>
>     $ ci/docker/runtime_functions.sh clean_repo
>
>     $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
>
>     Once it crashes it will stop.
>
>     Then go in the container:
>
>
>     $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
>
>     A core should be there.
>
>     you might need to install gdb as root by executing the previous command
>     without uid so you can use apt-get.
>
>
>
>
>     Good luck.
>
>
>
>
>
>
>
>     On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dz...@amazon.com> wrote:
>
>     > Thanks a lot for locating the error.
>     > Could you tell me How you reproduce the error?
>     >
>     > On 5/3/18, 7:45 AM, "Pedro Larroy" <pe...@gmail.com>
> wrote:
>     >
>     >     Looks like a problem in mkl's same_shape
>     >
>     >     the pointer to mkldnn::memory::desc &desc  looks invalid.
>     >
>     >     (More stack frames follow...)
>     >     (gdb) p desc
>     >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading
> variable>
>     >     (gdb) p dtype
>     >     $2 = 0
>     >     (gdb) p shape
>     >     $3 = (const mxnet::TShape &) @0x7f3905a58b50:
> {<nnvm::Tuple<long>> =
>     >     {static kStackCache = <optimized out>, ndim_ = 2,
> num_heap_allocated_
>     > = 0,
>     >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ =
> 0x0}, <No
>     > data
>     >     fields>}
>     >     (gdb)
>     >
>     >
>     >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com>
> wrote:
>     >
>     >     > There are a few problems with valgrind, which makes it not an
> ideal
>     > tool
>     >     > for mxnet with python interface.
>     >     >
>     >     > First, valgrind generates a huge number of irrelevant
> messages, most
>     > of
>     >     > them from in Python itself.
>     >     >
>     >     > Second, valgrind can't emulate all CPU instructions. I
> remember that
>     > when
>     >     > I run valgrind with mxnet, valgrind exits with a strange
> error. I
>     > later on
>     >     > found that it was caused by an unsupported CPU instructions.
>     >     >
>     >     > Third, valgrind doesn't support multithreading well. As far as
> I
>     > know,
>     >     > valgrind runs everything in a single thread even if the
> program uses
>     >     > multi-threading. An error like this, which is likely caused by
> race
>     >     > condition, can't be caught by valgrind.
>     >     >
>     >     > I used to use Address Sanitizer for memory errors. This tool
> is much
>     >     > faster and can work with multi-threads. However, it doesn't
> work with
>     >     > Python for some reason.
>     >     >
>     >     > One thing we potentially can do is to use memory checker for
> C++ unit
>     >     > tests. Not sure it'll cover all memory errors we want.
>     >     >
>     >     > Best,
>     >     > Da
>     >     >
>     >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <
> pedro.larroy.lists@gmail.com>
>     > wrote:
>     >     >
>     >     >     It's very difficult to reproduce, non-deterministic. We
> were also
>     >     > running
>     >     >     without signal handlers in CI so there are no stack traces
>     >     > unfortunately.
>     >     >
>     >     >     Care to elaborate why valgrind doesn't work with Python?
>     >     >
>     >     >
>     >     >
>     >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <
> zhengda1936@gmail.com>
>     >     > wrote:
>     >     >
>     >     >     > can we build it in CI？segfault doesn't happen
> infrequently.
>     >     >     >
>     >     >     > 2018年5月2日 下午11:34，"Chris Olivier" <cjolivier01@gmail.com
> >写道：
>     >     >     >
>     >     >     > > you can try Intel Inspector, which is like an enhanced
>     > version of
>     >     >     > valgrind
>     >     >     > > with a GUI and whatnot.
>     >     >     > >
>     >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
>     > zhengda1936@gmail.com>
>     >     > wrote:
>     >     >     > >
>     >     >     > > > valgrind doesn't work with Python. also, valgrind
> doesn't
>     >     > support some
>     >     >     > > > CPU instructions used by MXNet (I think some
> instructions
>     >     > related to
>     >     >     > > > random generator).
>     >     >     > > >
>     >     >     > > >
>     >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>     >     > bhavinthaker@gmail.com>
>     >     >     > > > wrote:
>     >     >     > > > > Have you tried running with valgrind to get some
> clues
>     > on the
>     >     >     > > root-cause?
>     >     >     > > > >
>     >     >     > > > > Bhavin Thaker.
>     >     >     > > > >
>     >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
>     > zhengda1936@gmail.com
>     >     > >
>     >     >     > wrote:
>     >     >     > > > >
>     >     >     > > > >> It might also be possible that this isn't an
> MKLDNN bug.
>     >     >     > > > >> I just saw a similar memory error without MKLDNN
> build.
>     >     >     > > > >>
>     >     >     > > > >>
>     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     > organizations/jenkins/
>     >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
>     >     >     > > > >>
>     >     >     > > > >> Best,
>     >     >     > > > >> Da
>     >     >     > > > >>
>     >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
>     > dzzhen@amazon.com>
>     >     >     > wrote:
>     >     >     > > > >> > There might be a race condition that causes the
> memory
>     >     > error.
>     >     >     > > > >> > It might be caused by this PR:
>     >     >     > > > >> > https://github.com/apache/
> incubator-mxnet/pull/10706/
>     > files
>     >     >     > > > >> > This PR removes MKLDNN memory from NDArray.
>     >     >     > > > >> > However, I don't know why this causes memory
> error. If
>     >     > someone is
>     >     >     > > > using
>     >     >     > > > >> the memory, it should still hold the memory with
> shared
>     >     > pointer.
>     >     >     > > > >> > But I do see the memory error increase after
> this PR
>     > is
>     >     > merged.
>     >     >     > > > >> >
>     >     >     > > > >> > Best,
>     >     >     > > > >> > Da
>     >     >     > > > >> >
>     >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>     >     >     > pedro.larroy.lists@gmail.com>
>     >     >     > > > >> wrote:
>     >     >     > > > >> >
>     >     >     > > > >> >     I couldn't reproduce locally with:
>     >     >     > > > >> >
>     >     >     > > > >> >     ci/build.py -p ubuntu_cpu
>     > /work/runtime_functions.sh
>     >     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py
> --platform
>     >     > ubuntu_cpu
>     >     >     > > > >> >     /work/runtime_functions.sh
>     > unittest_ubuntu_python2_cpu
>     >     >     > > > >> >
>     >     >     > > > >> >
>     >     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro
> Larroy <
>     >     >     > > > >> pedro.larroy.lists@gmail.com>
>     >     >     > > > >> >     wrote:
>     >     >     > > > >> >
>     >     >     > > > >> >     > Hi
>     >     >     > > > >> >     >
>     >     >     > > > >> >     > Seems master is not running  anymore,
> there's a
>     >     > segmentation
>     >     >     > > > fault
>     >     >     > > > >> using
>     >     >     > > > >> >     > MKDLNN-CPU
>     >     >     > > > >> >     >
>     >     >     > > > >> >     >
>     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     > organizations/jenkins/
>     >     >     > > > >> >     > incubator-mxnet/detail/master/
> 801/pipeline/662
>     >     >     > > > >> >     >
>     >     >     > > > >> >     >
>     >     >     > > > >> >     > I see my PRs failing with a similar error.
>     >     >     > > > >> >     >
>     >     >     > > > >> >     > Pedro
>     >     >     > > > >> >     >
>     >     >     > > > >> >
>     >     >     > > > >> >
>     >     >     > > > >>
>     >     >     > > >
>     >     >     > >
>     >     >     >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>
>
>

Re: segmentation fault in master using mkdlnn

Posted by "Zheng, Da" <dz...@amazon.com>.

Hello Pedro,

I tried your instructions. It seems I can't run the docker in EC2 instances.
Where did you reproduce the error?

Thanks,
Da

+ echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
+ gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg: directory `/root/.gnupg' created
gpg: new configuration file `/root/.gnupg/gpg.conf' created
gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during this run
gpg: keyring `/root/.gnupg/secring.gpg' created
gpg: keyring `/root/.gnupg/pubring.gpg' created
gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
gpg: keyserver timed out
gpg: keyserver receive failed: keyserver error
The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
Traceback (most recent call last):
  File "ci/build.py", line 263, in <module>
    sys.exit(main())
  File "ci/build.py", line 197, in main
    build_docker(platform, docker_binary)
  File "ci/build.py", line 73, in build_docker
    check_call(cmd)
  File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['docker', 'build', '-f', 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000', '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2


On 5/3/18, 8:01 AM, "Pedro Larroy" <pe...@gmail.com> wrote:

    Hi Da
    
    Reproduction instructions:
    
    On the host:
    
    Adjust core pattern:
    
    $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
    
    
    Use the following patch:
    
    ===============
    
    diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
    --- a/3rdparty/mkldnn
    +++ b/3rdparty/mkldnn
    @@ -1 +1 @@
    -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
    +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
    diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
    index 027e287..62649c9 100755
    --- a/ci/docker/runtime_functions.sh
    +++ b/ci/docker/runtime_functions.sh
    @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
         # https://github.com/apache/incubator-mxnet/issues/10026
         #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
         export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    -    nosetests-2.7 --verbose tests/python/unittest
    -    nosetests-2.7 --verbose tests/python/train
    -    nosetests-2.7 --verbose tests/python/quantization
    +    export MXNET_TEST_SEED=11
    +    export MXNET_MODULE_SEED=812478194
    +    pwd
    +    export MXNET_TEST_COUNT=10000
    +    ulimit -c unlimited
    +    ulimit -c
    +    while nosetests-2.7 --verbose
    tests/python/unittest/test_module.py:test_forward_reshape; do echo round;
    done
    +    #nosetests-2.7 --verbose tests/python/train
    +    #nosetests-2.7 --verbose tests/python/quantization
     }
    
     unittest_ubuntu_python3_cpu() {
    
    
    
    ==============
    
    Build and execute the test, make sure the repo is clean
    
    $ ci/docker/runtime_functions.sh clean_repo
    
    $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
    build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
    /work/runtime_functions.sh unittest_ubuntu_python2_cpu
    
    
    Once it crashes it will stop.
    
    Then go in the container:
    
    
    $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
    
    A core should be there.
    
    you might need to install gdb as root by executing the previous command
    without uid so you can use apt-get.
    
    
    
    
    Good luck.
    
    
    
    
    
    
    
    On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dz...@amazon.com> wrote:
    
    > Thanks a lot for locating the error.
    > Could you tell me How you reproduce the error?
    >
    > On 5/3/18, 7:45 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
    >
    >     Looks like a problem in mkl's same_shape
    >
    >     the pointer to mkldnn::memory::desc &desc  looks invalid.
    >
    >     (More stack frames follow...)
    >     (gdb) p desc
    >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading variable>
    >     (gdb) p dtype
    >     $2 = 0
    >     (gdb) p shape
    >     $3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>> =
    >     {static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_
    > = 0,
    >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0}, <No
    > data
    >     fields>}
    >     (gdb)
    >
    >
    >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com> wrote:
    >
    >     > There are a few problems with valgrind, which makes it not an ideal
    > tool
    >     > for mxnet with python interface.
    >     >
    >     > First, valgrind generates a huge number of irrelevant messages, most
    > of
    >     > them from in Python itself.
    >     >
    >     > Second, valgrind can't emulate all CPU instructions. I remember that
    > when
    >     > I run valgrind with mxnet, valgrind exits with a strange error. I
    > later on
    >     > found that it was caused by an unsupported CPU instructions.
    >     >
    >     > Third, valgrind doesn't support multithreading well. As far as I
    > know,
    >     > valgrind runs everything in a single thread even if the program uses
    >     > multi-threading. An error like this, which is likely caused by race
    >     > condition, can't be caught by valgrind.
    >     >
    >     > I used to use Address Sanitizer for memory errors. This tool is much
    >     > faster and can work with multi-threads. However, it doesn't work with
    >     > Python for some reason.
    >     >
    >     > One thing we potentially can do is to use memory checker for C++ unit
    >     > tests. Not sure it'll cover all memory errors we want.
    >     >
    >     > Best,
    >     > Da
    >     >
    >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <pe...@gmail.com>
    > wrote:
    >     >
    >     >     It's very difficult to reproduce, non-deterministic. We were also
    >     > running
    >     >     without signal handlers in CI so there are no stack traces
    >     > unfortunately.
    >     >
    >     >     Care to elaborate why valgrind doesn't work with Python?
    >     >
    >     >
    >     >
    >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com>
    >     > wrote:
    >     >
    >     >     > can we build it in CI？segfault doesn't happen infrequently.
    >     >     >
    >     >     > 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
    >     >     >
    >     >     > > you can try Intel Inspector, which is like an enhanced
    > version of
    >     >     > valgrind
    >     >     > > with a GUI and whatnot.
    >     >     > >
    >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
    > zhengda1936@gmail.com>
    >     > wrote:
    >     >     > >
    >     >     > > > valgrind doesn't work with Python. also, valgrind doesn't
    >     > support some
    >     >     > > > CPU instructions used by MXNet (I think some instructions
    >     > related to
    >     >     > > > random generator).
    >     >     > > >
    >     >     > > >
    >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
    >     > bhavinthaker@gmail.com>
    >     >     > > > wrote:
    >     >     > > > > Have you tried running with valgrind to get some clues
    > on the
    >     >     > > root-cause?
    >     >     > > > >
    >     >     > > > > Bhavin Thaker.
    >     >     > > > >
    >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
    > zhengda1936@gmail.com
    >     > >
    >     >     > wrote:
    >     >     > > > >
    >     >     > > > >> It might also be possible that this isn't an MKLDNN bug.
    >     >     > > > >> I just saw a similar memory error without MKLDNN build.
    >     >     > > > >>
    >     >     > > > >>
    >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    >     > organizations/jenkins/
    >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
    >     >     > > > >>
    >     >     > > > >> Best,
    >     >     > > > >> Da
    >     >     > > > >>
    >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
    > dzzhen@amazon.com>
    >     >     > wrote:
    >     >     > > > >> > There might be a race condition that causes the memory
    >     > error.
    >     >     > > > >> > It might be caused by this PR:
    >     >     > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/
    > files
    >     >     > > > >> > This PR removes MKLDNN memory from NDArray.
    >     >     > > > >> > However, I don't know why this causes memory error. If
    >     > someone is
    >     >     > > > using
    >     >     > > > >> the memory, it should still hold the memory with shared
    >     > pointer.
    >     >     > > > >> > But I do see the memory error increase after this PR
    > is
    >     > merged.
    >     >     > > > >> >
    >     >     > > > >> > Best,
    >     >     > > > >> > Da
    >     >     > > > >> >
    >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
    >     >     > pedro.larroy.lists@gmail.com>
    >     >     > > > >> wrote:
    >     >     > > > >> >
    >     >     > > > >> >     I couldn't reproduce locally with:
    >     >     > > > >> >
    >     >     > > > >> >     ci/build.py -p ubuntu_cpu
    > /work/runtime_functions.sh
    >     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform
    >     > ubuntu_cpu
    >     >     > > > >> >     /work/runtime_functions.sh
    > unittest_ubuntu_python2_cpu
    >     >     > > > >> >
    >     >     > > > >> >
    >     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
    >     >     > > > >> pedro.larroy.lists@gmail.com>
    >     >     > > > >> >     wrote:
    >     >     > > > >> >
    >     >     > > > >> >     > Hi
    >     >     > > > >> >     >
    >     >     > > > >> >     > Seems master is not running  anymore, there's a
    >     > segmentation
    >     >     > > > fault
    >     >     > > > >> using
    >     >     > > > >> >     > MKDLNN-CPU
    >     >     > > > >> >     >
    >     >     > > > >> >     >
    >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    >     > organizations/jenkins/
    >     >     > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
    >     >     > > > >> >     >
    >     >     > > > >> >     >
    >     >     > > > >> >     > I see my PRs failing with a similar error.
    >     >     > > > >> >     >
    >     >     > > > >> >     > Pedro
    >     >     > > > >> >     >
    >     >     > > > >> >
    >     >     > > > >> >
    >     >     > > > >>
    >     >     > > >
    >     >     > >
    >     >     >
    >     >
    >     >
    >     >
    >
    >
    >

Re: segmentation fault in master using mkdlnn

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Da

Reproduction instructions:

On the host:

Adjust core pattern:

$ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern


Use the following patch:

===============

diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
--- a/3rdparty/mkldnn
+++ b/3rdparty/mkldnn
@@ -1 +1 @@
-Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
+Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
index 027e287..62649c9 100755
--- a/ci/docker/runtime_functions.sh
+++ b/ci/docker/runtime_functions.sh
@@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
     # https://github.com/apache/incubator-mxnet/issues/10026
     #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
     export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
-    nosetests-2.7 --verbose tests/python/unittest
-    nosetests-2.7 --verbose tests/python/train
-    nosetests-2.7 --verbose tests/python/quantization
+    export MXNET_TEST_SEED=11
+    export MXNET_MODULE_SEED=812478194
+    pwd
+    export MXNET_TEST_COUNT=10000
+    ulimit -c unlimited
+    ulimit -c
+    while nosetests-2.7 --verbose
tests/python/unittest/test_module.py:test_forward_reshape; do echo round;
done
+    #nosetests-2.7 --verbose tests/python/train
+    #nosetests-2.7 --verbose tests/python/quantization
 }

 unittest_ubuntu_python3_cpu() {



==============

Build and execute the test, make sure the repo is clean

$ ci/docker/runtime_functions.sh clean_repo

$ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
/work/runtime_functions.sh unittest_ubuntu_python2_cpu


Once it crashes it will stop.

Then go in the container:


$ ci/build.py -p ubuntu_cpu --into-container --print-docker-run

A core should be there.

you might need to install gdb as root by executing the previous command
without uid so you can use apt-get.




Good luck.







On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dz...@amazon.com> wrote:

> Thanks a lot for locating the error.
> Could you tell me How you reproduce the error?
>
> On 5/3/18, 7:45 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
>
>     Looks like a problem in mkl's same_shape
>
>     the pointer to mkldnn::memory::desc &desc  looks invalid.
>
>     (More stack frames follow...)
>     (gdb) p desc
>     $1 = (const mkldnn::memory::desc &) @0x10: <error reading variable>
>     (gdb) p dtype
>     $2 = 0
>     (gdb) p shape
>     $3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>> =
>     {static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_
> = 0,
>         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0}, <No
> data
>     fields>}
>     (gdb)
>
>
>     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com> wrote:
>
>     > There are a few problems with valgrind, which makes it not an ideal
> tool
>     > for mxnet with python interface.
>     >
>     > First, valgrind generates a huge number of irrelevant messages, most
> of
>     > them from in Python itself.
>     >
>     > Second, valgrind can't emulate all CPU instructions. I remember that
> when
>     > I run valgrind with mxnet, valgrind exits with a strange error. I
> later on
>     > found that it was caused by an unsupported CPU instructions.
>     >
>     > Third, valgrind doesn't support multithreading well. As far as I
> know,
>     > valgrind runs everything in a single thread even if the program uses
>     > multi-threading. An error like this, which is likely caused by race
>     > condition, can't be caught by valgrind.
>     >
>     > I used to use Address Sanitizer for memory errors. This tool is much
>     > faster and can work with multi-threads. However, it doesn't work with
>     > Python for some reason.
>     >
>     > One thing we potentially can do is to use memory checker for C++ unit
>     > tests. Not sure it'll cover all memory errors we want.
>     >
>     > Best,
>     > Da
>     >
>     > On 5/3/18, 6:50 AM, "Pedro Larroy" <pe...@gmail.com>
> wrote:
>     >
>     >     It's very difficult to reproduce, non-deterministic. We were also
>     > running
>     >     without signal handlers in CI so there are no stack traces
>     > unfortunately.
>     >
>     >     Care to elaborate why valgrind doesn't work with Python?
>     >
>     >
>     >
>     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com>
>     > wrote:
>     >
>     >     > can we build it in CI？segfault doesn't happen infrequently.
>     >     >
>     >     > 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
>     >     >
>     >     > > you can try Intel Inspector, which is like an enhanced
> version of
>     >     > valgrind
>     >     > > with a GUI and whatnot.
>     >     > >
>     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
> zhengda1936@gmail.com>
>     > wrote:
>     >     > >
>     >     > > > valgrind doesn't work with Python. also, valgrind doesn't
>     > support some
>     >     > > > CPU instructions used by MXNet (I think some instructions
>     > related to
>     >     > > > random generator).
>     >     > > >
>     >     > > >
>     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>     > bhavinthaker@gmail.com>
>     >     > > > wrote:
>     >     > > > > Have you tried running with valgrind to get some clues
> on the
>     >     > > root-cause?
>     >     > > > >
>     >     > > > > Bhavin Thaker.
>     >     > > > >
>     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
> zhengda1936@gmail.com
>     > >
>     >     > wrote:
>     >     > > > >
>     >     > > > >> It might also be possible that this isn't an MKLDNN bug.
>     >     > > > >> I just saw a similar memory error without MKLDNN build.
>     >     > > > >>
>     >     > > > >>
>     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     > organizations/jenkins/
>     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
>     >     > > > >>
>     >     > > > >> Best,
>     >     > > > >> Da
>     >     > > > >>
>     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
> dzzhen@amazon.com>
>     >     > wrote:
>     >     > > > >> > There might be a race condition that causes the memory
>     > error.
>     >     > > > >> > It might be caused by this PR:
>     >     > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/
> files
>     >     > > > >> > This PR removes MKLDNN memory from NDArray.
>     >     > > > >> > However, I don't know why this causes memory error. If
>     > someone is
>     >     > > > using
>     >     > > > >> the memory, it should still hold the memory with shared
>     > pointer.
>     >     > > > >> > But I do see the memory error increase after this PR
> is
>     > merged.
>     >     > > > >> >
>     >     > > > >> > Best,
>     >     > > > >> > Da
>     >     > > > >> >
>     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>     >     > pedro.larroy.lists@gmail.com>
>     >     > > > >> wrote:
>     >     > > > >> >
>     >     > > > >> >     I couldn't reproduce locally with:
>     >     > > > >> >
>     >     > > > >> >     ci/build.py -p ubuntu_cpu
> /work/runtime_functions.sh
>     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform
>     > ubuntu_cpu
>     >     > > > >> >     /work/runtime_functions.sh
> unittest_ubuntu_python2_cpu
>     >     > > > >> >
>     >     > > > >> >
>     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>     >     > > > >> pedro.larroy.lists@gmail.com>
>     >     > > > >> >     wrote:
>     >     > > > >> >
>     >     > > > >> >     > Hi
>     >     > > > >> >     >
>     >     > > > >> >     > Seems master is not running  anymore, there's a
>     > segmentation
>     >     > > > fault
>     >     > > > >> using
>     >     > > > >> >     > MKDLNN-CPU
>     >     > > > >> >     >
>     >     > > > >> >     >
>     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     > organizations/jenkins/
>     >     > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
>     >     > > > >> >     >
>     >     > > > >> >     >
>     >     > > > >> >     > I see my PRs failing with a similar error.
>     >     > > > >> >     >
>     >     > > > >> >     > Pedro
>     >     > > > >> >     >
>     >     > > > >> >
>     >     > > > >> >
>     >     > > > >>
>     >     > > >
>     >     > >
>     >     >
>     >
>     >
>     >
>
>
>

Re: segmentation fault in master using mkdlnn

Posted by "Zheng, Da" <dz...@amazon.com>.

Thanks a lot for locating the error.
Could you tell me How you reproduce the error? 

On 5/3/18, 7:45 AM, "Pedro Larroy" <pe...@gmail.com> wrote:

    Looks like a problem in mkl's same_shape
    
    the pointer to mkldnn::memory::desc &desc  looks invalid.
    
    (More stack frames follow...)
    (gdb) p desc
    $1 = (const mkldnn::memory::desc &) @0x10: <error reading variable>
    (gdb) p dtype
    $2 = 0
    (gdb) p shape
    $3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>> =
    {static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_ = 0,
        data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0}, <No data
    fields>}
    (gdb)
    
    
    On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com> wrote:
    
    > There are a few problems with valgrind, which makes it not an ideal tool
    > for mxnet with python interface.
    >
    > First, valgrind generates a huge number of irrelevant messages, most of
    > them from in Python itself.
    >
    > Second, valgrind can't emulate all CPU instructions. I remember that when
    > I run valgrind with mxnet, valgrind exits with a strange error. I later on
    > found that it was caused by an unsupported CPU instructions.
    >
    > Third, valgrind doesn't support multithreading well. As far as I know,
    > valgrind runs everything in a single thread even if the program uses
    > multi-threading. An error like this, which is likely caused by race
    > condition, can't be caught by valgrind.
    >
    > I used to use Address Sanitizer for memory errors. This tool is much
    > faster and can work with multi-threads. However, it doesn't work with
    > Python for some reason.
    >
    > One thing we potentially can do is to use memory checker for C++ unit
    > tests. Not sure it'll cover all memory errors we want.
    >
    > Best,
    > Da
    >
    > On 5/3/18, 6:50 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
    >
    >     It's very difficult to reproduce, non-deterministic. We were also
    > running
    >     without signal handlers in CI so there are no stack traces
    > unfortunately.
    >
    >     Care to elaborate why valgrind doesn't work with Python?
    >
    >
    >
    >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com>
    > wrote:
    >
    >     > can we build it in CI？segfault doesn't happen infrequently.
    >     >
    >     > 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
    >     >
    >     > > you can try Intel Inspector, which is like an enhanced version of
    >     > valgrind
    >     > > with a GUI and whatnot.
    >     > >
    >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com>
    > wrote:
    >     > >
    >     > > > valgrind doesn't work with Python. also, valgrind doesn't
    > support some
    >     > > > CPU instructions used by MXNet (I think some instructions
    > related to
    >     > > > random generator).
    >     > > >
    >     > > >
    >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
    > bhavinthaker@gmail.com>
    >     > > > wrote:
    >     > > > > Have you tried running with valgrind to get some clues on the
    >     > > root-cause?
    >     > > > >
    >     > > > > Bhavin Thaker.
    >     > > > >
    >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1936@gmail.com
    > >
    >     > wrote:
    >     > > > >
    >     > > > >> It might also be possible that this isn't an MKLDNN bug.
    >     > > > >> I just saw a similar memory error without MKLDNN build.
    >     > > > >>
    >     > > > >>
    >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    > organizations/jenkins/
    >     > > incubator-mxnet/detail/PR-10783/1/pipeline
    >     > > > >>
    >     > > > >> Best,
    >     > > > >> Da
    >     > > > >>
    >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com>
    >     > wrote:
    >     > > > >> > There might be a race condition that causes the memory
    > error.
    >     > > > >> > It might be caused by this PR:
    >     > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
    >     > > > >> > This PR removes MKLDNN memory from NDArray.
    >     > > > >> > However, I don't know why this causes memory error. If
    > someone is
    >     > > > using
    >     > > > >> the memory, it should still hold the memory with shared
    > pointer.
    >     > > > >> > But I do see the memory error increase after this PR is
    > merged.
    >     > > > >> >
    >     > > > >> > Best,
    >     > > > >> > Da
    >     > > > >> >
    >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
    >     > pedro.larroy.lists@gmail.com>
    >     > > > >> wrote:
    >     > > > >> >
    >     > > > >> >     I couldn't reproduce locally with:
    >     > > > >> >
    >     > > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
    >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform
    > ubuntu_cpu
    >     > > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
    >     > > > >> >
    >     > > > >> >
    >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
    >     > > > >> pedro.larroy.lists@gmail.com>
    >     > > > >> >     wrote:
    >     > > > >> >
    >     > > > >> >     > Hi
    >     > > > >> >     >
    >     > > > >> >     > Seems master is not running  anymore, there's a
    > segmentation
    >     > > > fault
    >     > > > >> using
    >     > > > >> >     > MKDLNN-CPU
    >     > > > >> >     >
    >     > > > >> >     >
    >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    > organizations/jenkins/
    >     > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
    >     > > > >> >     >
    >     > > > >> >     >
    >     > > > >> >     > I see my PRs failing with a similar error.
    >     > > > >> >     >
    >     > > > >> >     > Pedro
    >     > > > >> >     >
    >     > > > >> >
    >     > > > >> >
    >     > > > >>
    >     > > >
    >     > >
    >     >
    >
    >
    >

Re: segmentation fault in master using mkdlnn

Posted by Pedro Larroy <pe...@gmail.com>.

Looks like a problem in mkl's same_shape

the pointer to mkldnn::memory::desc &desc  looks invalid.

(More stack frames follow...)
(gdb) p desc
$1 = (const mkldnn::memory::desc &) @0x10: <error reading variable>
(gdb) p dtype
$2 = 0
(gdb) p shape
$3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>> =
{static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_ = 0,
    data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0}, <No data
fields>}
(gdb)


On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dz...@amazon.com> wrote:

> There are a few problems with valgrind, which makes it not an ideal tool
> for mxnet with python interface.
>
> First, valgrind generates a huge number of irrelevant messages, most of
> them from in Python itself.
>
> Second, valgrind can't emulate all CPU instructions. I remember that when
> I run valgrind with mxnet, valgrind exits with a strange error. I later on
> found that it was caused by an unsupported CPU instructions.
>
> Third, valgrind doesn't support multithreading well. As far as I know,
> valgrind runs everything in a single thread even if the program uses
> multi-threading. An error like this, which is likely caused by race
> condition, can't be caught by valgrind.
>
> I used to use Address Sanitizer for memory errors. This tool is much
> faster and can work with multi-threads. However, it doesn't work with
> Python for some reason.
>
> One thing we potentially can do is to use memory checker for C++ unit
> tests. Not sure it'll cover all memory errors we want.
>
> Best,
> Da
>
> On 5/3/18, 6:50 AM, "Pedro Larroy" <pe...@gmail.com> wrote:
>
>     It's very difficult to reproduce, non-deterministic. We were also
> running
>     without signal handlers in CI so there are no stack traces
> unfortunately.
>
>     Care to elaborate why valgrind doesn't work with Python?
>
>
>
>     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com>
> wrote:
>
>     > can we build it in CI？segfault doesn't happen infrequently.
>     >
>     > 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
>     >
>     > > you can try Intel Inspector, which is like an enhanced version of
>     > valgrind
>     > > with a GUI and whatnot.
>     > >
>     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com>
> wrote:
>     > >
>     > > > valgrind doesn't work with Python. also, valgrind doesn't
> support some
>     > > > CPU instructions used by MXNet (I think some instructions
> related to
>     > > > random generator).
>     > > >
>     > > >
>     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
> bhavinthaker@gmail.com>
>     > > > wrote:
>     > > > > Have you tried running with valgrind to get some clues on the
>     > > root-cause?
>     > > > >
>     > > > > Bhavin Thaker.
>     > > > >
>     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1936@gmail.com
> >
>     > wrote:
>     > > > >
>     > > > >> It might also be possible that this isn't an MKLDNN bug.
>     > > > >> I just saw a similar memory error without MKLDNN build.
>     > > > >>
>     > > > >>
>     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
>     > > incubator-mxnet/detail/PR-10783/1/pipeline
>     > > > >>
>     > > > >> Best,
>     > > > >> Da
>     > > > >>
>     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com>
>     > wrote:
>     > > > >> > There might be a race condition that causes the memory
> error.
>     > > > >> > It might be caused by this PR:
>     > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
>     > > > >> > This PR removes MKLDNN memory from NDArray.
>     > > > >> > However, I don't know why this causes memory error. If
> someone is
>     > > > using
>     > > > >> the memory, it should still hold the memory with shared
> pointer.
>     > > > >> > But I do see the memory error increase after this PR is
> merged.
>     > > > >> >
>     > > > >> > Best,
>     > > > >> > Da
>     > > > >> >
>     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>     > pedro.larroy.lists@gmail.com>
>     > > > >> wrote:
>     > > > >> >
>     > > > >> >     I couldn't reproduce locally with:
>     > > > >> >
>     > > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform
> ubuntu_cpu
>     > > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>     > > > >> >
>     > > > >> >
>     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>     > > > >> pedro.larroy.lists@gmail.com>
>     > > > >> >     wrote:
>     > > > >> >
>     > > > >> >     > Hi
>     > > > >> >     >
>     > > > >> >     > Seems master is not running  anymore, there's a
> segmentation
>     > > > fault
>     > > > >> using
>     > > > >> >     > MKDLNN-CPU
>     > > > >> >     >
>     > > > >> >     >
>     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
>     > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
>     > > > >> >     >
>     > > > >> >     >
>     > > > >> >     > I see my PRs failing with a similar error.
>     > > > >> >     >
>     > > > >> >     > Pedro
>     > > > >> >     >
>     > > > >> >
>     > > > >> >
>     > > > >>
>     > > >
>     > >
>     >
>
>
>

Re: segmentation fault in master using mkdlnn

Posted by "Zheng, Da" <dz...@amazon.com>.

There are a few problems with valgrind, which makes it not an ideal tool for mxnet with python interface.

First, valgrind generates a huge number of irrelevant messages, most of them from in Python itself.

Second, valgrind can't emulate all CPU instructions. I remember that when I run valgrind with mxnet, valgrind exits with a strange error. I later on found that it was caused by an unsupported CPU instructions.

Third, valgrind doesn't support multithreading well. As far as I know, valgrind runs everything in a single thread even if the program uses multi-threading. An error like this, which is likely caused by race condition, can't be caught by valgrind.

I used to use Address Sanitizer for memory errors. This tool is much faster and can work with multi-threads. However, it doesn't work with Python for some reason. 

One thing we potentially can do is to use memory checker for C++ unit tests. Not sure it'll cover all memory errors we want.

Best,
Da

On 5/3/18, 6:50 AM, "Pedro Larroy" <pe...@gmail.com> wrote:

    It's very difficult to reproduce, non-deterministic. We were also running
    without signal handlers in CI so there are no stack traces unfortunately.
    
    Care to elaborate why valgrind doesn't work with Python?
    
    
    
    On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com> wrote:
    
    > can we build it in CI？segfault doesn't happen infrequently.
    >
    > 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
    >
    > > you can try Intel Inspector, which is like an enhanced version of
    > valgrind
    > > with a GUI and whatnot.
    > >
    > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com> wrote:
    > >
    > > > valgrind doesn't work with Python. also, valgrind doesn't support some
    > > > CPU instructions used by MXNet (I think some instructions related to
    > > > random generator).
    > > >
    > > >
    > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bh...@gmail.com>
    > > > wrote:
    > > > > Have you tried running with valgrind to get some clues on the
    > > root-cause?
    > > > >
    > > > > Bhavin Thaker.
    > > > >
    > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com>
    > wrote:
    > > > >
    > > > >> It might also be possible that this isn't an MKLDNN bug.
    > > > >> I just saw a similar memory error without MKLDNN build.
    > > > >>
    > > > >>
    > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    > > incubator-mxnet/detail/PR-10783/1/pipeline
    > > > >>
    > > > >> Best,
    > > > >> Da
    > > > >>
    > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com>
    > wrote:
    > > > >> > There might be a race condition that causes the memory error.
    > > > >> > It might be caused by this PR:
    > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
    > > > >> > This PR removes MKLDNN memory from NDArray.
    > > > >> > However, I don't know why this causes memory error. If someone is
    > > > using
    > > > >> the memory, it should still hold the memory with shared pointer.
    > > > >> > But I do see the memory error increase after this PR is merged.
    > > > >> >
    > > > >> > Best,
    > > > >> > Da
    > > > >> >
    > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
    > pedro.larroy.lists@gmail.com>
    > > > >> wrote:
    > > > >> >
    > > > >> >     I couldn't reproduce locally with:
    > > > >> >
    > > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
    > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
    > > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
    > > > >> >
    > > > >> >
    > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
    > > > >> pedro.larroy.lists@gmail.com>
    > > > >> >     wrote:
    > > > >> >
    > > > >> >     > Hi
    > > > >> >     >
    > > > >> >     > Seems master is not running  anymore, there's a segmentation
    > > > fault
    > > > >> using
    > > > >> >     > MKDLNN-CPU
    > > > >> >     >
    > > > >> >     >
    > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
    > > > >> >     >
    > > > >> >     >
    > > > >> >     > I see my PRs failing with a similar error.
    > > > >> >     >
    > > > >> >     > Pedro
    > > > >> >     >
    > > > >> >
    > > > >> >
    > > > >>
    > > >
    > >
    >

Re: segmentation fault in master using mkdlnn

Posted by Pedro Larroy <pe...@gmail.com>.

Hi

Managed to get a stack trace:

+ nosetests-2.7 --verbose
tests/python/unittest/test_module.py:test_forward_reshape
[WARNING] *** module-level seed is set: all tests running deterministically
***
[INFO] Setting module np/mx/python random seeds, use
MXNET_MODULE_SEED=812478194 to reproduce.
[WARNING] *** test-level seed set: all "@with_seed()" tests run
deterministically ***
test_module.test_forward_reshape ... [INFO] Setting test np/mx/python
random seeds, use MXNET_TEST_SEED=11 to reproduce.
[13:54:40] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 81920 bytes
with malloc directly
[13:54:40] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 576000 bytes
with malloc directly
/work/mxnet/python/mxnet/module/base_module.py:66: UserWarning: Data
provided by label_shapes don't match names specified by label_names ([] vs.
['softmax_label'])
  warnings.warn(msg)

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a)
[0x7f7fed68e8fa]
[bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x309619f)
[0x7f7ff029b19f]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f801aa774b0]
[bt] (3)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::NDArray::GetMKLDNNData()
const+0x637) [0x7f7fefde2a57]
[bt] (4)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::NDArray::GetMKLDNNDataReorder(mkldnn::memory::primitive_desc
const&) const+0x33c) [0x7f7fefde512c]
[bt] (5)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForward(nnvm::NodeAttrs
const&, mxnet::OpContext const&, std::vector<mxnet::NDArray,
std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType,
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray,
std::allocator<mxnet::NDArray> > const&)+0x26e0) [0x7f7fed68b150]
[bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x28da1ce)
[0x7f7fefadf1ce]
[bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x29eaed7)
[0x7f7fefbefed7]
[bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x29eafc1)
[0x7f7fefbeffc1]
[bt] (9)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
mxnet::engine::OprBlock*)+0xcb5) [0x7f7ff01b1f65]
ok


On Thu, May 3, 2018 at 3:57 PM, Pedro Larroy <pe...@gmail.com>
wrote:

> @Chris seems intel inspector requires purchasing right? maybe some of us
> already owns a license and can execute the test that fails intermittently?
>  test_module.py:test_forward_reshape
>
> On Thu, May 3, 2018 at 3:49 PM, Pedro Larroy <pedro.larroy.lists@gmail.com
> > wrote:
>
>> It's very difficult to reproduce, non-deterministic. We were also running
>> without signal handlers in CI so there are no stack traces unfortunately.
>>
>> Care to elaborate why valgrind doesn't work with Python?
>>
>>
>>
>> On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com> wrote:
>>
>>> can we build it in CI？segfault doesn't happen infrequently.
>>>
>>> 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
>>>
>>> > you can try Intel Inspector, which is like an enhanced version of
>>> valgrind
>>> > with a GUI and whatnot.
>>> >
>>> > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com> wrote:
>>> >
>>> > > valgrind doesn't work with Python. also, valgrind doesn't support
>>> some
>>> > > CPU instructions used by MXNet (I think some instructions related to
>>> > > random generator).
>>> > >
>>> > >
>>> > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>>> bhavinthaker@gmail.com>
>>> > > wrote:
>>> > > > Have you tried running with valgrind to get some clues on the
>>> > root-cause?
>>> > > >
>>> > > > Bhavin Thaker.
>>> > > >
>>> > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com>
>>> wrote:
>>> > > >
>>> > > >> It might also be possible that this isn't an MKLDNN bug.
>>> > > >> I just saw a similar memory error without MKLDNN build.
>>> > > >>
>>> > > >>
>>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>> > incubator-mxnet/detail/PR-10783/1/pipeline
>>> > > >>
>>> > > >> Best,
>>> > > >> Da
>>> > > >>
>>> > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com>
>>> wrote:
>>> > > >> > There might be a race condition that causes the memory error.
>>> > > >> > It might be caused by this PR:
>>> > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
>>> > > >> > This PR removes MKLDNN memory from NDArray.
>>> > > >> > However, I don't know why this causes memory error. If someone
>>> is
>>> > > using
>>> > > >> the memory, it should still hold the memory with shared pointer.
>>> > > >> > But I do see the memory error increase after this PR is merged.
>>> > > >> >
>>> > > >> > Best,
>>> > > >> > Da
>>> > > >> >
>>> > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>>> pedro.larroy.lists@gmail.com>
>>> > > >> wrote:
>>> > > >> >
>>> > > >> >     I couldn't reproduce locally with:
>>> > > >> >
>>> > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>>> > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>>> > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>>> > > >> >
>>> > > >> >
>>> > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>>> > > >> pedro.larroy.lists@gmail.com>
>>> > > >> >     wrote:
>>> > > >> >
>>> > > >> >     > Hi
>>> > > >> >     >
>>> > > >> >     > Seems master is not running  anymore, there's a
>>> segmentation
>>> > > fault
>>> > > >> using
>>> > > >> >     > MKDLNN-CPU
>>> > > >> >     >
>>> > > >> >     >
>>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>> > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
>>> > > >> >     >
>>> > > >> >     >
>>> > > >> >     > I see my PRs failing with a similar error.
>>> > > >> >     >
>>> > > >> >     > Pedro
>>> > > >> >     >
>>> > > >> >
>>> > > >> >
>>> > > >>
>>> > >
>>> >
>>>
>>
>>
>

Re: segmentation fault in master using mkdlnn

Posted by Pedro Larroy <pe...@gmail.com>.

@Chris seems intel inspector requires purchasing right? maybe some of us
already owns a license and can execute the test that fails intermittently?
 test_module.py:test_forward_reshape

On Thu, May 3, 2018 at 3:49 PM, Pedro Larroy <pe...@gmail.com>
wrote:

> It's very difficult to reproduce, non-deterministic. We were also running
> without signal handlers in CI so there are no stack traces unfortunately.
>
> Care to elaborate why valgrind doesn't work with Python?
>
>
>
> On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com> wrote:
>
>> can we build it in CI？segfault doesn't happen infrequently.
>>
>> 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
>>
>> > you can try Intel Inspector, which is like an enhanced version of
>> valgrind
>> > with a GUI and whatnot.
>> >
>> > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com> wrote:
>> >
>> > > valgrind doesn't work with Python. also, valgrind doesn't support some
>> > > CPU instructions used by MXNet (I think some instructions related to
>> > > random generator).
>> > >
>> > >
>> > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bhavinthaker@gmail.com
>> >
>> > > wrote:
>> > > > Have you tried running with valgrind to get some clues on the
>> > root-cause?
>> > > >
>> > > > Bhavin Thaker.
>> > > >
>> > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com>
>> wrote:
>> > > >
>> > > >> It might also be possible that this isn't an MKLDNN bug.
>> > > >> I just saw a similar memory error without MKLDNN build.
>> > > >>
>> > > >>
>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>> > incubator-mxnet/detail/PR-10783/1/pipeline
>> > > >>
>> > > >> Best,
>> > > >> Da
>> > > >>
>> > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com>
>> wrote:
>> > > >> > There might be a race condition that causes the memory error.
>> > > >> > It might be caused by this PR:
>> > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
>> > > >> > This PR removes MKLDNN memory from NDArray.
>> > > >> > However, I don't know why this causes memory error. If someone is
>> > > using
>> > > >> the memory, it should still hold the memory with shared pointer.
>> > > >> > But I do see the memory error increase after this PR is merged.
>> > > >> >
>> > > >> > Best,
>> > > >> > Da
>> > > >> >
>> > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>> pedro.larroy.lists@gmail.com>
>> > > >> wrote:
>> > > >> >
>> > > >> >     I couldn't reproduce locally with:
>> > > >> >
>> > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>> > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>> > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>> > > >> >
>> > > >> >
>> > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>> > > >> pedro.larroy.lists@gmail.com>
>> > > >> >     wrote:
>> > > >> >
>> > > >> >     > Hi
>> > > >> >     >
>> > > >> >     > Seems master is not running  anymore, there's a
>> segmentation
>> > > fault
>> > > >> using
>> > > >> >     > MKDLNN-CPU
>> > > >> >     >
>> > > >> >     >
>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>> > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
>> > > >> >     >
>> > > >> >     >
>> > > >> >     > I see my PRs failing with a similar error.
>> > > >> >     >
>> > > >> >     > Pedro
>> > > >> >     >
>> > > >> >
>> > > >> >
>> > > >>
>> > >
>> >
>>
>
>

Re: segmentation fault in master using mkdlnn

Posted by Pedro Larroy <pe...@gmail.com>.

It's very difficult to reproduce, non-deterministic. We were also running
without signal handlers in CI so there are no stack traces unfortunately.

Care to elaborate why valgrind doesn't work with Python?



On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zh...@gmail.com> wrote:

> can we build it in CI？segfault doesn't happen infrequently.
>
> 2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：
>
> > you can try Intel Inspector, which is like an enhanced version of
> valgrind
> > with a GUI and whatnot.
> >
> > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com> wrote:
> >
> > > valgrind doesn't work with Python. also, valgrind doesn't support some
> > > CPU instructions used by MXNet (I think some instructions related to
> > > random generator).
> > >
> > >
> > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bh...@gmail.com>
> > > wrote:
> > > > Have you tried running with valgrind to get some clues on the
> > root-cause?
> > > >
> > > > Bhavin Thaker.
> > > >
> > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com>
> wrote:
> > > >
> > > >> It might also be possible that this isn't an MKLDNN bug.
> > > >> I just saw a similar memory error without MKLDNN build.
> > > >>
> > > >>
> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > incubator-mxnet/detail/PR-10783/1/pipeline
> > > >>
> > > >> Best,
> > > >> Da
> > > >>
> > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com>
> wrote:
> > > >> > There might be a race condition that causes the memory error.
> > > >> > It might be caused by this PR:
> > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
> > > >> > This PR removes MKLDNN memory from NDArray.
> > > >> > However, I don't know why this causes memory error. If someone is
> > > using
> > > >> the memory, it should still hold the memory with shared pointer.
> > > >> > But I do see the memory error increase after this PR is merged.
> > > >> >
> > > >> > Best,
> > > >> > Da
> > > >> >
> > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
> pedro.larroy.lists@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> >     I couldn't reproduce locally with:
> > > >> >
> > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> > > >> >
> > > >> >
> > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
> > > >> pedro.larroy.lists@gmail.com>
> > > >> >     wrote:
> > > >> >
> > > >> >     > Hi
> > > >> >     >
> > > >> >     > Seems master is not running  anymore, there's a segmentation
> > > fault
> > > >> using
> > > >> >     > MKDLNN-CPU
> > > >> >     >
> > > >> >     >
> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
> > > >> >     >
> > > >> >     >
> > > >> >     > I see my PRs failing with a similar error.
> > > >> >     >
> > > >> >     > Pedro
> > > >> >     >
> > > >> >
> > > >> >
> > > >>
> > >
> >
>

Re: segmentation fault in master using mkdlnn

Posted by Da Zheng <zh...@gmail.com>.

can we build it in CI？segfault doesn't happen infrequently.

2018年5月2日 下午11:34，"Chris Olivier" <cj...@gmail.com>写道：

> you can try Intel Inspector, which is like an enhanced version of valgrind
> with a GUI and whatnot.
>
> On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com> wrote:
>
> > valgrind doesn't work with Python. also, valgrind doesn't support some
> > CPU instructions used by MXNet (I think some instructions related to
> > random generator).
> >
> >
> > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bh...@gmail.com>
> > wrote:
> > > Have you tried running with valgrind to get some clues on the
> root-cause?
> > >
> > > Bhavin Thaker.
> > >
> > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com> wrote:
> > >
> > >> It might also be possible that this isn't an MKLDNN bug.
> > >> I just saw a similar memory error without MKLDNN build.
> > >>
> > >>
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/PR-10783/1/pipeline
> > >>
> > >> Best,
> > >> Da
> > >>
> > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com> wrote:
> > >> > There might be a race condition that causes the memory error.
> > >> > It might be caused by this PR:
> > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
> > >> > This PR removes MKLDNN memory from NDArray.
> > >> > However, I don't know why this causes memory error. If someone is
> > using
> > >> the memory, it should still hold the memory with shared pointer.
> > >> > But I do see the memory error increase after this PR is merged.
> > >> >
> > >> > Best,
> > >> > Da
> > >> >
> > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <pe...@gmail.com>
> > >> wrote:
> > >> >
> > >> >     I couldn't reproduce locally with:
> > >> >
> > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> > >> >
> > >> >
> > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
> > >> pedro.larroy.lists@gmail.com>
> > >> >     wrote:
> > >> >
> > >> >     > Hi
> > >> >     >
> > >> >     > Seems master is not running  anymore, there's a segmentation
> > fault
> > >> using
> > >> >     > MKDLNN-CPU
> > >> >     >
> > >> >     >
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > >> >     > incubator-mxnet/detail/master/801/pipeline/662
> > >> >     >
> > >> >     >
> > >> >     > I see my PRs failing with a similar error.
> > >> >     >
> > >> >     > Pedro
> > >> >     >
> > >> >
> > >> >
> > >>
> >
>

Re: segmentation fault in master using mkdlnn

Posted by Chris Olivier <cj...@gmail.com>.

you can try Intel Inspector, which is like an enhanced version of valgrind
with a GUI and whatnot.

On Wed, May 2, 2018 at 9:42 PM Da Zheng <zh...@gmail.com> wrote:

> valgrind doesn't work with Python. also, valgrind doesn't support some
> CPU instructions used by MXNet (I think some instructions related to
> random generator).
>
>
> On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bh...@gmail.com>
> wrote:
> > Have you tried running with valgrind to get some clues on the root-cause?
> >
> > Bhavin Thaker.
> >
> > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com> wrote:
> >
> >> It might also be possible that this isn't an MKLDNN bug.
> >> I just saw a similar memory error without MKLDNN build.
> >>
> >>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10783/1/pipeline
> >>
> >> Best,
> >> Da
> >>
> >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com> wrote:
> >> > There might be a race condition that causes the memory error.
> >> > It might be caused by this PR:
> >> > https://github.com/apache/incubator-mxnet/pull/10706/files
> >> > This PR removes MKLDNN memory from NDArray.
> >> > However, I don't know why this causes memory error. If someone is
> using
> >> the memory, it should still hold the memory with shared pointer.
> >> > But I do see the memory error increase after this PR is merged.
> >> >
> >> > Best,
> >> > Da
> >> >
> >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <pe...@gmail.com>
> >> wrote:
> >> >
> >> >     I couldn't reproduce locally with:
> >> >
> >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> >> >
> >> >
> >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
> >> pedro.larroy.lists@gmail.com>
> >> >     wrote:
> >> >
> >> >     > Hi
> >> >     >
> >> >     > Seems master is not running  anymore, there's a segmentation
> fault
> >> using
> >> >     > MKDLNN-CPU
> >> >     >
> >> >     >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> >> >     > incubator-mxnet/detail/master/801/pipeline/662
> >> >     >
> >> >     >
> >> >     > I see my PRs failing with a similar error.
> >> >     >
> >> >     > Pedro
> >> >     >
> >> >
> >> >
> >>
>

Re: segmentation fault in master using mkdlnn

Posted by Da Zheng <zh...@gmail.com>.

valgrind doesn't work with Python. also, valgrind doesn't support some
CPU instructions used by MXNet (I think some instructions related to
random generator).


On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bh...@gmail.com> wrote:
> Have you tried running with valgrind to get some clues on the root-cause?
>
> Bhavin Thaker.
>
> On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com> wrote:
>
>> It might also be possible that this isn't an MKLDNN bug.
>> I just saw a similar memory error without MKLDNN build.
>>
>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10783/1/pipeline
>>
>> Best,
>> Da
>>
>> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com> wrote:
>> > There might be a race condition that causes the memory error.
>> > It might be caused by this PR:
>> > https://github.com/apache/incubator-mxnet/pull/10706/files
>> > This PR removes MKLDNN memory from NDArray.
>> > However, I don't know why this causes memory error. If someone is using
>> the memory, it should still hold the memory with shared pointer.
>> > But I do see the memory error increase after this PR is merged.
>> >
>> > Best,
>> > Da
>> >
>> > On 5/2/18, 12:26 PM, "Pedro Larroy" <pe...@gmail.com>
>> wrote:
>> >
>> >     I couldn't reproduce locally with:
>> >
>> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>> >
>> >
>> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>> pedro.larroy.lists@gmail.com>
>> >     wrote:
>> >
>> >     > Hi
>> >     >
>> >     > Seems master is not running  anymore, there's a segmentation fault
>> using
>> >     > MKDLNN-CPU
>> >     >
>> >     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>> >     > incubator-mxnet/detail/master/801/pipeline/662
>> >     >
>> >     >
>> >     > I see my PRs failing with a similar error.
>> >     >
>> >     > Pedro
>> >     >
>> >
>> >
>>

Re: segmentation fault in master using mkdlnn

Posted by Bhavin Thaker <bh...@gmail.com>.

Have you tried running with valgrind to get some clues on the root-cause?

Bhavin Thaker.

On Wed, May 2, 2018 at 8:55 PM Da Zheng <zh...@gmail.com> wrote:

> It might also be possible that this isn't an MKLDNN bug.
> I just saw a similar memory error without MKLDNN build.
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10783/1/pipeline
>
> Best,
> Da
>
> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com> wrote:
> > There might be a race condition that causes the memory error.
> > It might be caused by this PR:
> > https://github.com/apache/incubator-mxnet/pull/10706/files
> > This PR removes MKLDNN memory from NDArray.
> > However, I don't know why this causes memory error. If someone is using
> the memory, it should still hold the memory with shared pointer.
> > But I do see the memory error increase after this PR is merged.
> >
> > Best,
> > Da
> >
> > On 5/2/18, 12:26 PM, "Pedro Larroy" <pe...@gmail.com>
> wrote:
> >
> >     I couldn't reproduce locally with:
> >
> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> >
> >
> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> >     wrote:
> >
> >     > Hi
> >     >
> >     > Seems master is not running  anymore, there's a segmentation fault
> using
> >     > MKDLNN-CPU
> >     >
> >     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> >     > incubator-mxnet/detail/master/801/pipeline/662
> >     >
> >     >
> >     > I see my PRs failing with a similar error.
> >     >
> >     > Pedro
> >     >
> >
> >
>

Re: segmentation fault in master using mkdlnn

Posted by Da Zheng <zh...@gmail.com>.

It might also be possible that this isn't an MKLDNN bug.
I just saw a similar memory error without MKLDNN build.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10783/1/pipeline

Best,
Da

On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dz...@amazon.com> wrote:
> There might be a race condition that causes the memory error.
> It might be caused by this PR:
> https://github.com/apache/incubator-mxnet/pull/10706/files
> This PR removes MKLDNN memory from NDArray.
> However, I don't know why this causes memory error. If someone is using the memory, it should still hold the memory with shared pointer.
> But I do see the memory error increase after this PR is merged.
>
> Best,
> Da
>
> On 5/2/18, 12:26 PM, "Pedro Larroy" <pe...@gmail.com> wrote:
>
>     I couldn't reproduce locally with:
>
>     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
>
>     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <pe...@gmail.com>
>     wrote:
>
>     > Hi
>     >
>     > Seems master is not running  anymore, there's a segmentation fault using
>     > MKDLNN-CPU
>     >
>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>     > incubator-mxnet/detail/master/801/pipeline/662
>     >
>     >
>     > I see my PRs failing with a similar error.
>     >
>     > Pedro
>     >
>
>

Re: segmentation fault in master using mkdlnn

Posted by "Zheng, Da" <dz...@amazon.com>.

There might be a race condition that causes the memory error.
It might be caused by this PR:
https://github.com/apache/incubator-mxnet/pull/10706/files
This PR removes MKLDNN memory from NDArray.
However, I don't know why this causes memory error. If someone is using the memory, it should still hold the memory with shared pointer.
But I do see the memory error increase after this PR is merged.

Best,
Da

On 5/2/18, 12:26 PM, "Pedro Larroy" <pe...@gmail.com> wrote:

    I couldn't reproduce locally with:

    ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
    build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
    /work/runtime_functions.sh unittest_ubuntu_python2_cpu

    On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <pe...@gmail.com>
    wrote:

    > Hi
    >
    > Seems master is not running  anymore, there's a segmentation fault using
    > MKDLNN-CPU
    >
    > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    > incubator-mxnet/detail/master/801/pipeline/662
    >
    >
    > I see my PRs failing with a similar error.
    >
    > Pedro
    >

Re: segmentation fault in master using mkdlnn

Posted by Pedro Larroy <pe...@gmail.com>.

I couldn't reproduce locally with:

ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
/work/runtime_functions.sh unittest_ubuntu_python2_cpu


On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <pe...@gmail.com>
wrote:

> Hi
>
> Seems master is not running  anymore, there's a segmentation fault using
> MKDLNN-CPU
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/master/801/pipeline/662
>
>
> I see my PRs failing with a similar error.
>
> Pedro
>