You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/08/27 01:08:16 UTC

[GitHub] [incubator-mxnet] karan6181 opened a new issue #19024: SSD Training fails with free pointer issue during end of training

karan6181 opened a new issue #19024:
URL: https://github.com/apache/incubator-mxnet/issues/19024


   - The SSD training [script](https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/ssd/train_ssd.py) fails with either `free(): invalid pointer` or `corrupted size vs. prev_size`
   - Tried running the script with and without horovod mode on p3dn. Below are details:
   
   ## 1. Without Horovod: 
   
   #### Cmd:
   ```
   python gluon-cv/scripts/detection/ssd/train_ssd.py --gpus 0,1,2,3,4,5,6,7 -j 32 --network resnet50_v1 --data-shape 512 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 1 --batch-size 64 --log-interval 100 --val-interval 20 --save-interval 20
   ```
   
   #### Failure:
   ```
   free(): invalid pointer
   ```
   
   #### Output log file: https://gist.github.com/karan6181/2ce3d8c68406aae5cd1e208aaf7dd5fd#file-mxnet_ssd-log
   
   ## 2. With Horovod: 
   
   #### Cmd:
   ```
   horovodrun -np 8 python gluon-cv/scripts/detection/ssd/train_ssd.py -j 32 --network resnet50_v1 --data-shape 512 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 1 --horovod --batch-size 64 --log-interval 100 --val-interval 20 --save-interval 20
   ```
   
   #### Failure:
   ```
   [1,1]<stderr>:corrupted size vs. prev_size
   [1,1]<stderr>:[ip-100-64-13-241:09515] *** Process received signal ***
   [1,1]<stderr>:[ip-100-64-13-241:09515] Signal: Aborted (6)
   [1,1]<stderr>:[ip-100-64-13-241:09515] Signal code:  (-6)
   [1,1]<stderr>:[ip-100-64-13-241:09515] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7fb2d87948a0]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [1,1]<stderr>:[ 1] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fb2d83cff47]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [ 2] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fb2d83d18b1]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [ 3] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x89907)[0x7fb2d841a907]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [ 4] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x9097a)[0x7fb2d842197a]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [ 5] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x90b7c)[0x7fb2d8421b7c]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [ 6] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x94848)[0x7fb2d8425848]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [ 7] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x27d)[0x7fb2d842835d]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [1,1]<stderr>:[ 8] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/bin/../lib/libstdc++.so.6(_Znwm+0x15)[0x7fb269b344e5]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [ 9] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38b43cd)[0x7fb28d8dd3cd]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [10] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38ba8c6)[0x7fb28d8e38c6]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [11] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bac16)[0x7fb28d8e3c16]
   [1,1]<stderr>:[ip-100-64-13-241:09515] [12] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bfe60)[0x7fb28d8e8e60]
   ```
   #### Output log file: https://gist.github.com/karan6181/2ce3d8c68406aae5cd1e208aaf7dd5fd#file-mxnet_ssd_horovod_single_node-log
   
   #### GluonCV: 0.8.0 (build from source)
   
   #### Horovod:
   ```
   Horovod v0.19.5:
   
   Available Frameworks:
       [ ] TensorFlow
       [ ] PyTorch
       [X] MXNet
   
   Available Controllers:
       [X] MPI
       [X] Gloo
   
   Available Tensor Operations:
       [X] NCCL
       [ ] DDL
       [ ] CCL
       [X] MPI
       [X] Gloo
   ```
   
   ## MXNet Diagnosis:
   ```
   Architecture:        x86_64
   CPU op-mode(s):      32-bit, 64-bit
   Byte Order:          Little Endian
   CPU(s):              96
   On-line CPU(s) list: 0-95
   Thread(s) per core:  2
   Core(s) per socket:  24
   Socket(s):           2
   NUMA node(s):        2
   Vendor ID:           GenuineIntel
   CPU family:          6
   Model:               85
   Model name:          Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
   Stepping:            4
   CPU MHz:             1200.041
   BogoMIPS:            4999.99
   Hypervisor vendor:   KVM
   Virtualization type: full
   L1d cache:           32K
   L1i cache:           32K
   L2 cache:            1024K
   L3 cache:            33792K
   NUMA node0 CPU(s):   0-23,48-71
   NUMA node1 CPU(s):   24-47,72-95
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
   ----------Python Info----------
   Version      : 3.6.10
   Compiler     : GCC 7.3.0
   Build        : ('default', 'Mar 25 2020 23:51:54')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 20.0.2
   Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.6.0
   Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
   Commit Hash   : 6de57440b792dca716f1214a81edf557c345fddb
   Library      : ['/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so']
   Build features:
   ✔ CUDA
   ✔ CUDNN
   ✔ NCCL
   ✔ CUDA_RTC
   ✖ TENSORRT
   ✔ CPU_SSE
   ✔ CPU_SSE2
   ✔ CPU_SSE3
   ✔ CPU_SSE4_1
   ✔ CPU_SSE4_2
   ✖ CPU_SSE4A
   ✔ CPU_AVX
   ✖ CPU_AVX2
   ✔ OPENMP
   ✖ SSE
   ✔ F16C
   ✖ JEMALLOC
   ✔ BLAS_OPEN
   ✖ BLAS_ATLAS
   ✖ BLAS_MKL
   ✖ BLAS_APPLE
   ✔ LAPACK
   ✔ MKLDNN
   ✔ OPENCV
   ✖ CAFFE
   ✖ PROFILER
   ✔ DIST_KVSTORE
   ✖ CXX14
   ✖ INT64_TENSOR_SIZE
   ✔ SIGNAL_HANDLER
   ✖ DEBUG
   ✖ TVM_OP
   ----------System Info----------
   Platform     : Linux-5.3.0-1032-aws-x86_64-with-debian-buster-sid
   system       : Linux
   node         : ip-100-64-13-241
   release      : 5.3.0-1032-aws
   version      : #34~18.04.2-Ubuntu SMP Fri Jul 24 10:06:28 UTC 2020
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0014 sec, LOAD: 0.3844 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0012 sec, LOAD: 0.0220 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0005 sec, LOAD: 0.0184 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0004 sec, LOAD: 0.1442 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0035 sec, LOAD: 0.0546 sec.
   Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0004246234893798828 sec.
   ----------Environment----------
   KMP_DUPLICATE_LIB_OK="True"
   KMP_INIT_AT_FORK="FALSE"
   Architecture:        x86_64
   CPU op-mode(s):      32-bit, 64-bit
   Byte Order:          Little Endian
   CPU(s):              96
   On-line CPU(s) list: 0-95
   Thread(s) per core:  2
   Core(s) per socket:  24
   Socket(s):           2
   NUMA node(s):        2
   Vendor ID:           GenuineIntel
   CPU family:          6
   Model:               85
   Model name:          Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
   Stepping:            4
   CPU MHz:             1305.290
   BogoMIPS:            4999.99
   Hypervisor vendor:   KVM
   Virtualization type: full
   L1d cache:           32K
   L1i cache:           32K
   L2 cache:            1024K
   L3 cache:            33792K
   NUMA node0 CPU(s):   0-23,48-71
   NUMA node1 CPU(s):   24-47,72-95
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
   ----------Python Info----------
   Version      : 3.6.10
   Compiler     : GCC 7.3.0
   Build        : ('default', 'Mar 25 2020 23:51:54')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 20.0.2
   Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.6.0
   Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
   Commit Hash   : 6de57440b792dca716f1214a81edf557c345fddb
   Library      : ['/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so']
   Build features:
   ✔ CUDA
   ✔ CUDNN
   ✔ NCCL
   ✔ CUDA_RTC
   ✖ TENSORRT
   ✔ CPU_SSE
   ✔ CPU_SSE2
   ✔ CPU_SSE3
   ✔ CPU_SSE4_1
   ✔ CPU_SSE4_2
   ✖ CPU_SSE4A
   ✔ CPU_AVX
   ✖ CPU_AVX2
   ✔ OPENMP
   ✖ SSE
   ✔ F16C
   ✖ JEMALLOC
   ✔ BLAS_OPEN
   ✖ BLAS_ATLAS
   ✖ BLAS_MKL
   ✖ BLAS_APPLE
   ✔ LAPACK
   ✔ MKLDNN
   ✔ OPENCV
   ✖ CAFFE
   ✖ PROFILER
   ✔ DIST_KVSTORE
   ✖ CXX14
   ✖ INT64_TENSOR_SIZE
   ✔ SIGNAL_HANDLER
   ✖ DEBUG
   ✖ TVM_OP
   ----------System Info----------
   Platform     : Linux-5.3.0-1032-aws-x86_64-with-debian-buster-sid
   system       : Linux
   node         : ip-100-64-13-241
   release      : 5.3.0-1032-aws
   version      : #34~18.04.2-Ubuntu SMP Fri Jul 24 10:06:28 UTC 2020
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0026 sec, LOAD: 0.3870 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0012 sec, LOAD: 0.0253 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0005 sec, LOAD: 0.3219 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0005 sec, LOAD: 0.1079 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0008 sec, LOAD: 0.0563 sec.
   Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0004470348358154297 sec.
   ----------Environment----------
   KMP_DUPLICATE_LIB_OK="True"
   KMP_INIT_AT_FORK="FALSE"
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] karan6181 commented on issue #19024: SSD Training fails with free pointer issue during end of training

Posted by GitBox <gi...@apache.org>.
karan6181 commented on issue #19024:
URL: https://github.com/apache/incubator-mxnet/issues/19024#issuecomment-682139049


   Thanks @leezu . I will try that patch and let u know. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] karan6181 commented on issue #19024: SSD Training fails with free pointer issue during end of training

Posted by GitBox <gi...@apache.org>.
karan6181 commented on issue #19024:
URL: https://github.com/apache/incubator-mxnet/issues/19024#issuecomment-681203855


   Initially, I created an issue (https://github.com/dmlc/gluon-cv/issues/1415) in Gluon CV thinking that this might be related to script issue. But by root causing that issue, I found that by adding `mx.nd.waitall()` at the end of the script, I dont see that crash anymore. From my understanding (correct me if I am wrong), One shouldn't call the `mx.nd.waitall()` explicitly and the MXNet engine should be able to release tensors accordingly after the operation has finished. 
   
   Is this a bug in MXNet or am i missing something here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu commented on issue #19024: SSD Training fails with free pointer issue during end of training

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #19024:
URL: https://github.com/apache/incubator-mxnet/issues/19024#issuecomment-681560441


   It's probably fixed by https://github.com/apache/incubator-mxnet/pull/18768 You can apply that commit to the 1.6 branch and check if the issue persists


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on issue #19024: SSD Training fails with free pointer issue during end of training

Posted by GitBox <gi...@apache.org>.
szha commented on issue #19024:
URL: https://github.com/apache/incubator-mxnet/issues/19024#issuecomment-694614126


   @karan6181 any update?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] austinmw commented on issue #19024: SSD Training fails with free pointer issue during end of training

Posted by GitBox <gi...@apache.org>.
austinmw commented on issue #19024:
URL: https://github.com/apache/incubator-mxnet/issues/19024#issuecomment-695077855


   Hi, I'm getting `corrupted size vs. prev_size` with:
   `horovodrun -np 4 -H localhost:4 python train_faster_rcnn.py --dataset coco --horovod --disable-hybridization --batch-size 4`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org