You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/10/30 22:05:30 UTC
[GitHub] [incubator-mxnet] larroy opened a new issue #16675: MaskRCNN unable to train with master, works with previous revisions

larroy opened a new issue #16675: MaskRCNN unable to train with master, works with previous revisions
URL: https://github.com/apache/incubator-mxnet/issues/16675
 
 
   ## Description
   
   I can't train mask rcnn with latest revisions of MXNet:
   
   https://gluon-cv.mxnet.io/build/examples_instance/train_mask_rcnn_coco.html
   
   This revision works:
   
    e9e267ef7 - (Sat, 14 Sep 2019 09:33:08 -0700) reminisce    - Fix remaining errors reported by D2L (#16157)
   
   This doesn't:
   
   86ed5f5c0 - (Mon, 28 Oct 2019 01:24:05 -0700) Huang, Gua.. - [NumPy][Operator] NumPy operator `may_share_memory` and `shares_memory` (#16533) (upstream/v1.6.x)
   
   
   I see very low throughput, high CPU usage and low GPU usage or it gets stuck completely.
   
   This can be reproduced either from source or from the latest pip builds, so I don't think  it's my environment or my build options.
   
   
   This is my build environment:
   
   ```
   USE_CUDA: "ON" # Build with CUDA support
   USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
   USE_NCCL: "ON" # Use NVidia NCCL with CUDA
   USE_OPENCV: "ON" # Build with OpenCV support
   USE_OPENMP: "PLATFORM" # Build with Openmp support
   USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for search path
   USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
   USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects support if "ON"
   USE_LAPACK: "ON" # Build with lapack support
   USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
   USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
   USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
   USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
   USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
   USE_JEMALLOC: "ON" # Build with Jemalloc support
   USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
   USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
   USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
   USE_CPP_PACKAGE: "OFF" # Build C++ Package
   USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
   USE_GPROF: "OFF" # Compile with gprof (profiling) flag
   USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it
   USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set VTUNE_ROOT for search path
   ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
   BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
   INSTALL_EXAMPLES: "OFF" # Install the example source files.
   USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
   USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
   USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
   ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric output
   CMAKE_BUILD_TYPE: "Release"
   CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
   CMAKE_C_COMPILER_LAUNCHER: "ccache"
   CMAKE_CXX_COMPILER_LAUNCHER: "ccache"
   ```
   
   
   ## Diagnose
   
   ```
   Architecture:        x86_64
   CPU op-mode(s):      32-bit, 64-bit
   Byte Order:          Little Endian
   CPU(s):              96
   On-line CPU(s) list: 0-95
   Thread(s) per core:  2
   Core(s) per socket:  24
   Socket(s):           2
   NUMA node(s):        2
   Vendor ID:           GenuineIntel
   CPU family:          6
   Model:               85
   Model name:          Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
   Stepping:            4
   CPU MHz:             3134.070
   BogoMIPS:            5000.00
   Hypervisor vendor:   KVM
   Virtualization type: full
   L1d cache:           32K
   L1i cache:           32K
   L2 cache:            1024K
   L3 cache:            33792K
   NUMA node0 CPU(s):   0-23,48-71
   NUMA node1 CPU(s):   24-47,72-95
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
   ----------Python Info----------
   Version      : 3.6.8
   Compiler     : GCC 8.3.0
   Build        : ('default', 'Oct  7 2019 12:59:55')
   Arch         : ('64bit', 'ELF')
   ------------Pip Info-----------
   Version      : 19.3.1
   Directory    : /home/piotr/mxnet/py3_venv/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.6.0
   Directory    : /home/piotr/mxnet/python/mxnet
   Commit hash file "/home/piotr/mxnet/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
   Library      : ['/home/piotr/mxnet/python/mxnet/../../build/libmxnet.so']
   Build features:
   ✔ CUDA
   ✔ CUDNN
   ✔ NCCL
   ✔ CUDA_RTC
   ✖ TENSORRT
   ✔ CPU_SSE
   ✔ CPU_SSE2
   ✔ CPU_SSE3
   ✔ CPU_SSE4_1
   ✔ CPU_SSE4_2
   ✖ CPU_SSE4A
   ✔ CPU_AVX
   ✖ CPU_AVX2
   ✔ OPENMP
   ✖ SSE
   ✔ F16C
   ✔ JEMALLOC
   ✔ BLAS_OPEN
   ✖ BLAS_ATLAS
   ✖ BLAS_MKL
   ✖ BLAS_APPLE
   ✔ LAPACK
   ✖ MKLDNN
   ✔ OPENCV
   ✖ CAFFE
   ✖ PROFILER
   ✖ DIST_KVSTORE
   ✖ CXX14
   ✖ INT64_TENSOR_SIZE
   ✔ SIGNAL_HANDLER
   ✖ DEBUG
   ✖ TVM_OP
   ----------System Info----------
   Platform     : Linux-4.15.0-1052-aws-x86_64-with-Ubuntu-18.04-bionic
   system       : Linux
   node         : 18-232-106-45
   release      : 4.15.0-1052-aws
   version      : #54-Ubuntu SMP Tue Oct 1 15:43:26 UTC 2019
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 0.4104 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0190 sec, LOAD: 0.0444 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0222 sec, LOAD: 0.3929 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0184 sec, LOAD: 0.3812 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0017 sec, LOAD: 0.0803 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0063 sec, LOAD: 0.0893 sec.
   ----------Environment----------
   (END)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services