You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2021/12/01 23:49:26 UTC

[GitHub] [incubator-mxnet] mseth10 opened a new issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

mseth10 opened a new issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766


   ## Description
   When building MXNet for AArch64 with MKL-DNN with ACL enabled, the build works but the binary fails on some tests, for eg, `test_deconv` in `test_gluon.py`. Here's the pipeline running the build:
   https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fsethman-test-cd-release-job/detail/sethman-test-cd-release-job/131/pipeline/82
   
   ### Error Message
   ```
   test_gluon.test_deconv ... python3: ../3rdparty/mkldnn/src/common/primitive.hpp:220: const T* dnnl::impl::resource_mapper_t::get(dnnl::impl::resource_mapper_t::key_t*) const [with T = dnnl::impl::cpu::aarch64::acl_indirect_gemm_resource_t; dnnl::impl::resource_mapper_t::key_t = const dnnl::impl::primitive_t]: Assertion `primitive_to_resource_.count(p)' failed.
   ```
   Here's the backtrace:
   ```
   #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
   #1  0x0000fffff7e7ed54 in __GI_abort () at abort.c:79
   #2  0x0000fffff7e8b61c in __assert_fail_base (fmt=0xfffff7f84c48 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
       assertion=assertion@entry=0xffff55f88aa8 "primitive_to_resource_.count(p)", file=file@entry=0xffff55f88a78 "../3rdparty/mkldnn/src/common/primitive.hpp", 
       line=line@entry=220, 
       function=function@entry=0xffff55f88990 "const T* dnnl::impl::resource_mapper_t::get(dnnl::impl::resource_mapper_t::key_t*) const [with T = dnnl::impl::cpu::aarch64::acl_indirect_gemm_resource_t; dnnl::impl::resource_mapper_t::key_t = const "...) at assert.c:92
   #3  0x0000fffff7e8b684 in __GI___assert_fail (assertion=0xffff55f88aa8 "primitive_to_resource_.count(p)", 
       file=0xffff55f88a78 "../3rdparty/mkldnn/src/common/primitive.hpp", line=220, 
       function=0xffff55f88990 "const T* dnnl::impl::resource_mapper_t::get(dnnl::impl::resource_mapper_t::key_t*) const [with T = dnnl::impl::cpu::aarch64::acl_indirect_gemm_resource_t; dnnl::impl::resource_mapper_t::key_t = const "...) at assert.c:101
   #4  0x0000ffff55459e10 in ?? () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #5  0x0000ffff554599a4 in ?? () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #6  0x0000ffff552f9850 in ?? () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #7  0x0000ffff55423734 in ?? () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #8  0x0000ffff5473a884 in ?? () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #9  0x0000ffff5475e5f4 in ?? () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #10 0x0000ffff54739ef0 in ?? () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #11 0x0000ffff5473a128 in ?? () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #12 0x0000ffff4b8c0414 in dnnl::primitive::execute(dnnl::stream const&, std::unordered_map<int, dnnl::memory, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, dnnl::memory> > > const&) const () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #13 0x0000ffff4b8c05d4 in mxnet::MKLDNNStream::Submit(bool) () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #14 0x0000ffff4ca53908 in mxnet::op::MKLDNNDeconvBwd::Execute(unsigned int, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::op::MKLDNNDeconvBwd::ReadTensors const&, mxnet::op::MKLDNNDeconvBwd::WriteTensors const&) const () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #15 0x0000ffff4ca52fa0 in mxnet::op::MKLDNNDeconvolutionBackward(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) ()
      from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #16 0x0000ffff4bbdcfec in void std::__invoke_impl<void, void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&>(std::__invoke_other, void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&
 , std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) ()
   --Type <RET> for more, q to quit, c to continue without paging--
     untu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #17 0x0000ffff4bbd8dac in std::enable_if<std::__and_<std::is_void<void>, std::__is_invocable<void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&> >::value, void>::type std::__invoke_r<void, void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::ve
 ctor<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&>(void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #18 0x0000ffff4bbd38c4 in std::_Function_handler<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), void (*)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>::_M_invoke(std::_Any_data const&, nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) ()
      from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #19 0x0000ffff4b988848 in std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>::operator()(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) const () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #20 0x0000ffff4ca2c784 in mxnet::MKLDNNRun(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>, nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #21 0x0000ffff4c6205ec in ?? () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #22 0x0000ffff4bbdcfec in void std::__invoke_impl<void, void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&>(std::__invoke_other, void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&
 , std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   #23 0x0000ffff4bbd8dac in std::enable_if<std::__and_<std::is_void<void>, std::__is_invocable<void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&> >::value, void>::type std::__invoke_r<void, void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::ve
 ctor<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&>(void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) () from /home/ubuntu/.local/lib/python3.8/site-packages/mxnet/libmxnet.so
   ```
   
   ## To Reproduce
   (If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
   
   ### Steps to reproduce
   (Paste the commands you ran that produced the error.)
   Working on a docker image for ease of reproduction
   
   ## What have you tried to solve it?
   
   1. Tried building with different OneDNN and ACL versions, but it did not help
   
   ## Environment
   
   ***We recommend using our script for collecting the diagnostic information with the following command***
   `curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3`
   
   <details>
   <summary>Environment Information</summary>
   
   ```
   ----------Python Info----------
   Version      : 3.6.8
   Compiler     : GCC 4.8.5 20150623 (Red Hat 4.8.5-44)
   Build        : ('default', 'Nov 16 2020 16:33:14')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 9.0.3
   Directory    : /usr/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   No MXNet installed.
   ----------System Info----------
   Platform     : Linux-5.11.0-1022-aws-aarch64-with-centos-7.9.2009-AltArch
   system       : Linux
   node         : c18cd793fd55
   release      : 5.11.0-1022-aws
   version      : #23~20.04.1-Ubuntu SMP Mon Nov 15 14:04:48 UTC 2021
   ----------Hardware Info----------
   machine      : aarch64
   processor    : aarch64
   Architecture:          aarch64
   Byte Order:            Little Endian
   CPU(s):                64
   On-line CPU(s) list:   0-63
   Thread(s) per core:    1
   Core(s) per socket:    64
   Socket(s):             1
   NUMA node(s):          1
   Model:                 1
   BogoMIPS:              243.75
   L1d cache:             64K
   L1i cache:             64K
   L2 cache:              1024K
   L3 cache:              32768K
   NUMA node0 CPU(s):     0-63
   Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0014 sec, LOAD: 0.4363 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1806 sec, LOAD: 0.1674 sec.
   Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:877)>, DNS finished in 0.2053844928741455 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0147 sec, LOAD: 0.0943 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0374 sec, LOAD: 5.3653 sec.
   Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.018440961837768555 sec.
   ----------Environment----------
   ```
   
   </details>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] nSircombe edited a comment on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

nSircombe edited a comment on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-1009700537






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] nSircombe commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

nSircombe commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-984390070


   Hi @mseth10,
   
   From a first look, I notice the failure appears to be with `mxnet::op::MKLDNNDeconvolutionBackward`.
   Compute Library (and therefore our framework builds with oneDNN + ACL) is intended for inference workloads, and so does not support `BWD_D` `prop_kind` and deconvolution. We have seen failures in deconvolution in oneDNN testing - I believe it _should_ use oneDNN's C++ ref. kernel in the case of deconvolution though, so this is a little puzzling...
   
   You appear to be building against Compute Library 21.05, but which version of oneDNN is included in the MXNET 1.x build? 
   Have you tried picking up the latest Compute Library release (21.11 is just out).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] mseth10 commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

mseth10 commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-992208238


   Hi @cfRod @nSircombe does the above backtrace help answer your questions. Please let me know what else you would like me to try. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] mseth10 commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

mseth10 commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-988320617


   Hi @nSircombe , thanks for the suggestion. MXNet v1.x currently points to OneDNN v2.3.2. I can try the latest ACL v21.11, should I also upgrade OneDNN version with that?
   
   ACL not supporting training specific operators makes sense, but it should ideally default to OneDNN's reference kernels. Looks like the issue is in OneDNN + ACL integration. Is this issue fixed in the later releases? Were the deconvolution failures fixed for you when you upgraded OneDNN/ACL versions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] mseth10 commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

mseth10 commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-1009400604


   Hi @cfRod @nSircombe , continuing from where we left off before the holidays, does the above logs help understand where the issue is coming from? Do you want me to fetch any additional logs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] mseth10 commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

mseth10 commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-989466011


   Thanks @cfRod for the suggestion. Here's a more extensive backtrace:
   ```
   #0  0x0000ffffbdce5238 in raise () from /lib64/libc.so.6
   #1  0x0000ffffbdce68b0 in abort () from /lib64/libc.so.6
   #2  0x0000ffffbdcde72c in __assert_fail_base () from /lib64/libc.so.6
   #3  0x0000ffffbdcde7e4 in __assert_fail () from /lib64/libc.so.6
   #4  0x0000ffffb8f37708 in dnnl::impl::resource_mapper_t::get<dnnl::impl::cpu::aarch64::acl_indirect_gemm_resource_t> (this=0x45520e8, p=0x47c8f20)
       at ../3rdparty/mkldnn/src/common/primitive.hpp:220
   #5  0x0000ffffb8f3729c in dnnl::impl::cpu::aarch64::acl_indirect_gemm_convolution_fwd_t::execute_forward (this=0x47c8f20, ctx=...)
       at ../3rdparty/mkldnn/src/cpu/aarch64/acl_indirect_gemm_convolution.cpp:36
   #6  0x0000ffffb8dd7148 in dnnl::impl::cpu::aarch64::acl_indirect_gemm_convolution_fwd_t::execute (this=0x47c8f20, ctx=...)
       at ../3rdparty/mkldnn/src/cpu/aarch64/acl_indirect_gemm_convolution.hpp:167
   #7  0x0000ffffb8f0102c in dnnl::impl::cpu::ref_deconvolution_bwd_data_t::execute (this=0x4e3cff0, ctx=...) at ../3rdparty/mkldnn/src/cpu/ref_deconvolution.cpp:566
   #8  0x0000ffffb821817c in dnnl_primitive::execute (this=0x45520c0, ctx=...) at ../3rdparty/mkldnn/src/common/primitive.cpp:264
   #9  0x0000ffffb823beec in dnnl_stream::enqueue_primitive (this=0x4aa8580, primitive_iface=0x45520c0, ctx=...) at ../3rdparty/mkldnn/src/common/stream.cpp:33
   #10 0x0000ffffb82177e8 in dnnl::impl::primitive_execute (primitive_iface=0x45520c0, ctx=...) at ../3rdparty/mkldnn/src/common/primitive.cpp:127
   #11 0x0000ffffb8217a20 in dnnl_primitive_execute (primitive_iface=0x45520c0, stream=0x4aa8580, nargs=3, c_args=0x4f8e340) at ../3rdparty/mkldnn/src/common/primitive.cpp:169
   Python Exception <class 'gdb.error'> There is no member or method named _M_bbegin.: 
   #12 0x0000ffffaf37cfc4 in dnnl::primitive::execute (this=0x4e3d0d0, astream=..., args=std::unordered_map with 3 elements) at ../3rdparty/mkldnn/include/oneapi/dnnl/dnnl.hpp:11044
   #13 0x0000ffffaf37d184 in mxnet::MKLDNNStream::Submit (this=0x4136998, cleanup=true) at ../src/common/./../operator/nn/mkldnn/mkldnn_base-inl.h:464
   #14 0x0000ffffb05104b8 in mxnet::op::MKLDNNDeconvBwd::Execute (this=0x4282710, num_group=1, req=std::vector of length 3, capacity 3 = {...}, read_tensors=..., write_tensors=...)
       at ../src/operator/nn/mkldnn/mkldnn_deconvolution.cc:267
   #15 0x0000ffffb050fb50 in mxnet::op::MKLDNNDeconvolutionBackward (attrs=..., ctx=..., inputs=std::vector of length 4, capacity 4 = {...}, 
       req=std::vector of length 3, capacity 3 = {...}, outputs=std::vector of length 3, capacity 3 = {...}) at ../src/operator/nn/mkldnn/mkldnn_deconvolution.cc:183
   #16 0x0000ffffaf699b9c in std::__invoke_impl<void, void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&> (
       __f=@0xfffff3a1f618: 0xffffb050f9bc <mxnet::op::MKLDNNDeconvolutionBackward(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>)
       at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/invoke.h:60
   #17 0x0000ffffaf69595c in std::__invoke_r<void, void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&> (
       __fn=@0xfffff3a1f618: 0xffffb050f9bc <mxnet::op::MKLDNNDeconvolutionBackward(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>)
       at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/invoke.h:153
   #18 0x0000ffffaf690474 in std::_Function_handler<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&), void (*)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>::_M_invoke(std::_Any_data const&, nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) (__functor=..., __args#0=..., 
       __args#1=..., __args#2=std::vector of length 4, capacity 4 = {...}, __args#3=std::vector of length 3, capacity 3 = {...}, __args#4=std::vector of length 3, capacity 3 = {...})
       at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_function.h:291
   #19 0x0000ffffaf4453f8 in std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>::operator()(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) const (this=0xfffff3a1f618, __args#0=..., __args#1=..., __args#2=std::vector of length 4, capacity 4 = {...}, 
       __args#3=std::vector of length 3, capacity 3 = {...}, __args#4=std::vector of length 3, capacity 3 = {...})
       at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_function.h:622
   #20 0x0000ffffb04e9334 in mxnet::MKLDNNRun(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, st---Type <return> to continue, or q <return> to quit---
   d::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>, nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) (fn=..., attrs=..., ctx=..., inputs=std::vector of length 4, capacity 4 = {...}, req=std::vector of length 3, capacity 3 = {...}, 
       outputs=std::vector of length 3, capacity 3 = {...}) at ../src/operator/nn/mkldnn/mkldnn_base.cc:672
   #21 0x0000ffffb00dd19c in mxnet::op::DeconvolutionGradComputeExCPU (attrs=..., ctx=..., inputs=std::vector of length 4, capacity 4 = {...}, 
       req=std::vector of length 3, capacity 3 = {...}, outputs=std::vector of length 3, capacity 3 = {...}) at ../src/operator/nn/deconvolution.cc:71
   ```
   
   Looks like oneDNN's deconvolution operator is called first `dnnl::impl::cpu::ref_deconvolution_bwd_data_t::execute (this=0x4e3cff0, ctx=...) at ../3rdparty/mkldnn/src/cpu/ref_deconvolution.cpp:566`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] cfRod edited a comment on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

cfRod edited a comment on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-992404951

Hi @mseth10 ,

- Could you confirm whether convolution-only tests pass, if it using ACL and paste those logs here for reference by setting DNNL_VERBOSE=1 ?
- For the deconvolution tests, it seems like it is calling the reference deconvolution but also calling the ACL implementation during the forward pass. I had a look at `test_deconv` https://github.com/apache/incubator-mxnet/blob/40359ceda150ca75da6e45b1ea35d747ef53deac/tests/python/unittest/test_gluon.py#L451 and it looks like it is calling `check_convolution_forward()`. As a confirmation, is the following test (`check_deconvolution_forward_backward`) passing? https://github.com/apache/incubator-mxnet/blob/40359ceda150ca75da6e45b1ea35d747ef53deac/tests/python/unittest/test_operator.py#L1086. Is it calling ACL during the forward pass and oneDNN reference in the backward pass? OR is calling reference in both forward and backward? If the test is passing, then it might be a peculiarity of the test itself rather than an integration issue. Either way, it might be worth understanding what this gluon test_deconv is actually doing.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] mseth10 commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

mseth10 commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-984211089


   @cfRod can you please help with this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] nSircombe commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

nSircombe commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-1009700537


   Hi @mseth10,
   
   From the log, it looks like the test is calling ACL for the convolution phase and the ref. kernels for the deconvolution, as it should. So, this test should be ok!
   
   Hummm.... there may be an underlying bug in the integration of the ACL primitive. This uses oneDNN's `resource_mapper` to hold the stateful ACL objects, there may be a problem freeing up these resources under these conditions which we've not seen previously.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] mseth10 edited a comment on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

mseth10 edited a comment on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-984211089


   @nSircombe @cfRod can you please take a look at this issue and provide some inputs? I used Tool-Solutions repository as a reference where you have worked on similar builds for other frameworks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] cfRod commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

cfRod commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-992404951


   Hi @mseth10 ,
   
   - Could you confirm whether convolution-only tests pass, if it using ACL and paste those logs here for reference by setting DNNL_VERBOSE=1 ?
   - For the deconvolution tests, it seems like it is calling the reference deconvolution but also calling the ACL implementation during the forward pass. I had a look at `test_deconv` https://github.com/apache/incubator-mxnet/blob/40359ceda150ca75da6e45b1ea35d747ef53deac/tests/python/unittest/test_gluon.py#L451 and it looks like it is calling `check_convolution_forward()`. As a confirmation, is the following test (`check_deconvolution_forward_backward`) passing? https://github.com/apache/incubator-mxnet/blob/40359ceda150ca75da6e45b1ea35d747ef53deac/tests/python/unittest/test_operator.py#L1086  If the test is passing, then it might be a peculiarity  of the test itself rather than an integration issue. Either way, it might be worth understanding what this gluon test_deconv is actually doing. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] nSircombe edited a comment on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

nSircombe edited a comment on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-1009700537


   Hi @mseth10,
   
   From the log, it looks like the test is calling ACL for the convolution phase and the ref. kernels for the deconvolution, as it should. So, this test should be ok...
   
   It may be there is an underlying bug in the integration of the ACL primitive. This uses oneDNN's `resource_mapper` to hold the stateful ACL objects, there may be a problem managing these resources under these conditions which we've not seen previously. It appears to be attempting to access an ACL object during the deconvolution phase, even though it's using the ref. kernel in this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] nSircombe edited a comment on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

nSircombe edited a comment on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-1009700537

Hi @mseth10,

From the log, it looks like the test is calling ACL for the convolution phase and the ref. kernels for the deconvolution, as it should. So, this test should be ok...

The integration of the ACL primitive uses oneDNN's `resource_mapper` to hold the stateful ACL objects, there may be a problem managing these resources under these conditions which we've not seen previously. It appears to be attempting to access an ACL object following the deconvolution phase - it may be expecting to have access to the ACL object from the convolution pass which has been freed in the meantime.

Do you have a log from an x86 run, or a run using a oneDNN build without the ACL backend (i.e. ref. kernels in all cases - this just means building oneDNN without setting `DDNNL_AARCH64_USE_ACL=ON`), it may help to see what the oneDNN log _should_ look like.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] cfRod edited a comment on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

cfRod edited a comment on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-988759914


   >ACL not supporting training specific operators makes sense, but it should ideally default to OneDNN's reference kernels
   
   It should fallback to reference as per this implementation list https://github.com/oneapi-src/oneDNN/blob/b2cd3a8e50a715f9326a35f4c503bd11e60235a5/src/cpu/cpu_deconvolution_list.cpp#L48. Since there is no deconvolution operator implemented in ACL here.
   
   For convolution, you can see the implementations here https://github.com/oneapi-src/oneDNN/blob/b2cd3a8e50a715f9326a35f4c503bd11e60235a5/src/cpu/cpu_convolution_list.cpp#L105 guarded by CPU_INSTANCE_AARCH64_ACL.
   and within the convolution we have checks to make sure FWD phases are enabled https://github.com/oneapi-src/oneDNN/blob/b2cd3a8e50a715f9326a35f4c503bd11e60235a5/src/cpu/aarch64/acl_convolution_utils.cpp#L47
   
   Could you please check if MxNET implements deconvolution via oneDNN's deconvolution operator?
   
   from the logs it seems it is calling convolution ops from ACL
   ```
   ) const [with T = dnnl::impl::cpu::aarch64::acl_indirect_gemm_resource_t; dnnl::impl::resource_mapper_t::key_t = const "...) at assert.c:92
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] cfRod commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

cfRod commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-988759914


   >ACL not supporting training specific operators makes sense, but it should ideally default to OneDNN's reference kernels
   
   It should fallback to reference as per this implementation list https://github.com/oneapi-src/oneDNN/blob/b2cd3a8e50a715f9326a35f4c503bd11e60235a5/src/cpu/cpu_deconvolution_list.cpp#L48. Since there is no deconvolution operator implemented in ACL here.
   
   For convolution, you can see the implementations here https://github.com/oneapi-src/oneDNN/blob/b2cd3a8e50a715f9326a35f4c503bd11e60235a5/src/cpu/cpu_convolution_list.cpp#L105 guarded by CPU_INSTANCE_AARCH64_ACL.
   and within the convolution we have checks to make sure FWD phases are enabled https://github.com/oneapi-src/oneDNN/blob/b2cd3a8e50a715f9326a35f4c503bd11e60235a5/src/cpu/aarch64/acl_convolution_utils.cpp#L47
   
   Could you please check if MxNET implements deconvolution via oneDNN's deconvolution operator?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] mseth10 commented on issue #20766: [Test] failure with MXNet AArch64 wheel (built with OneDNN + ACL)

Posted by GitBox <gi...@apache.org>.

mseth10 commented on issue #20766:
URL: https://github.com/apache/incubator-mxnet/issues/20766#issuecomment-993495652


   I set DNNL_VERBOSE=1 and ran the following `test_deconvolution` that runs `check_deconvolution_forward_backward`
   https://github.com/apache/incubator-mxnet/blob/40359ceda150ca75da6e45b1ea35d747ef53deac/tests/python/unittest/test_operator.py#L1204
   and got the following logs:
   ```
   + nosetests-3.4 --with-timer --timer-ok 1 --timer-warning 15 --timer-filter warning,error --verbose tests/python/unittest/test_operator.py:test_deconvolution
   [INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=2101606398 to reproduce.
   [12:24:14] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
   test_operator.test_deconvolution ... dnnl_verbose,info,oneDNN v2.3.2 (commit e2d45252ae9c3e91671339579e3c0f0061f81d49)
   dnnl_verbose,info,cpu,runtime:OpenMP
   dnnl_verbose,info,cpu,isa:AArch64 (with Advanced SIMD & floating-point)
   dnnl_verbose,info,gpu,runtime:none
   dnnl_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x5x5,1.26904
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x3x3,0.406006
   dnnl_verbose,exec,cpu,convolution,indirect_gemm:acl,forward_training,src_f32::blocked:acdb:f0 wei_f32::blocked:acdb:f0 bia_undef::undef::f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb1_ic1oc1_ih5oh5kh3sh1dh0ph1_iw5ow5kw3sw1dw0pw1,0.710938
   [12:24:14] ../src/operator/nn/mkldnn/mkldnn_base.cc:590: test Convolution
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x5x5,0.296875
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x5x5,0.203857
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x3x3,0.143799
   dnnl_verbose,exec,cpu,deconvolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_undef::undef::f0 dst_f32::blocked:abcd:f0,,alg:deconvolution_direct,mb1_ic1oc1_ih5oh5kh3sh1dh0ph1_iw5ow5kw3sw1dw0pw1,0.263916
   [12:24:14] ../src/operator/nn/mkldnn/mkldnn_base.cc:590: test Deconvolution
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x5x5,26.844
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x5x5,26.7649
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x3x3,26.749
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x5x5,26.7561
   dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,,1x1x3x3,26.7588
   python3: ../3rdparty/mkldnn/src/common/primitive.hpp:220: const T* dnnl::impl::resource_mapper_t::get(dnnl::impl::resource_mapper_t::key_t*) const [with T = dnnl::impl::cpu::aarch64::acl_indirect_gemm_resource_t; dnnl::impl::resource_mapper_t::key_t = const dnnl::impl::primitive_t]: Assertion `primitive_to_resource_.count(p)' failed.
   ```
   @cfRod does this show that convolution uses reference kernels but deconvolution calls acl?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org