You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/10/11 17:51:13 UTC
[GitHub] Vikas89 opened a new issue #12800: getting segfault while running
train_cifar10.py program in example directory
Vikas89 opened a new issue #12800: getting segfault while running train_cifar10.py program in example directory
URL: https://github.com/apache/incubator-mxnet/issues/12800
I am trying to run this command:
python example/image-classification/train_cifar10.py
And getting the segfault. It is very consistent.
```
INFO:root:Epoch[0] Batch [20] Speed: 18.95 samples/sec accuracy=0.164807
INFO:root:Epoch[0] Batch [40] Speed: 20.34 samples/sec accuracy=0.246875
INFO:root:Epoch[0] Batch [60] Speed: 21.66 samples/sec accuracy=0.308984
INFO:root:Epoch[0] Batch [80] Speed: 22.15 samples/sec accuracy=0.328125
INFO:root:Epoch[0] Batch [100] Speed: 22.38 samples/sec accuracy=0.362500
INFO:root:Epoch[0] Batch [120] Speed: 21.52 samples/sec accuracy=0.378125
INFO:root:Epoch[0] Batch [140] Speed: 22.96 samples/sec accuracy=0.417969
INFO:root:Epoch[0] Batch [160] Speed: 20.04 samples/sec accuracy=0.426563
INFO:root:Epoch[0] Batch [180] Speed: 18.58 samples/sec accuracy=0.430078
INFO:root:Epoch[0] Batch [200] Speed: 21.39 samples/sec accuracy=0.443750
INFO:root:Epoch[0] Batch [220] Speed: 17.69 samples/sec accuracy=0.469531
INFO:root:Epoch[0] Batch [240] Speed: 18.59 samples/sec accuracy=0.469141
INFO:root:Epoch[0] Batch [260] Speed: 21.68 samples/sec accuracy=0.470313
INFO:root:Epoch[0] Batch [280] Speed: 21.05 samples/sec accuracy=0.487891
INFO:root:Epoch[0] Batch [300] Speed: 22.30 samples/sec accuracy=0.503125
INFO:root:Epoch[0] Batch [320] Speed: 20.98 samples/sec accuracy=0.534766
INFO:root:Epoch[0] Batch [340] Speed: 15.53 samples/sec accuracy=0.526172
INFO:root:Epoch[0] Batch [360] Speed: 13.88 samples/sec accuracy=0.528516
Traceback (most recent call last):
File "example/image-classification/train_cifar10.py", line 79, in <module>
fit.fit(args, sym, data.get_rec_iter)
File "/Users/vikumar/incubator-mxnet/example/image-classification/common/fit.py", line 333, in fit
monitor=monitor)
File "/Users/vikumar/incubator-mxnet/python/mxnet/module/base_module.py", line 563, in fit
next_data_batch = next(data_iter)
File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 228, in __next__
return self.next()
File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 840, in next
check_call(_LIB.MXDataIterNext(self.handle, ctypes.byref(next_res)))
File "/Users/vikumar/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [09:58:31] src/io/recordio_split.cc:29: Check failed: (reinterpret_cast<size_t>(end) & 3UL) == 0U (1 vs. 0)
Stack trace returned 10 entries:
[bt] (0) 0 libmxnet.so 0x000000010ae2cf30 dmlc::StackTrace() + 272
[bt] (1) 1 libmxnet.so 0x000000010ae2ccdf dmlc::LogMessageFatal::~LogMessageFatal() + 47
[bt] (2) 2 libmxnet.so 0x000000010c68a01c dmlc::io::RecordIOSplitter::FindLastRecordBegin(char const*, char const*) + 444
[bt] (3) 3 libmxnet.so 0x000000010c68d10c dmlc::io::InputSplitBase::ReadChunk(void*, unsigned long*) + 284
[bt] (4) 4 libmxnet.so 0x000000010c68d230 dmlc::io::InputSplitBase::Chunk::Load(dmlc::io::InputSplitBase*, unsigned long) + 144
[bt] (5) 5 libmxnet.so 0x000000010c6a0b8f dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()::operator()() const + 895
[bt] (6) 6 libmxnet.so 0x000000010c6a071d void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()> >(void*) + 45
[bt] (7) 7 libsystem_pthread.dylib 0x00007fff903c293b _pthread_body + 180
[bt] (8) 8 libsystem_pthread.dylib 0x00007fff903c2887 _pthread_body + 0
[bt] (9) 9 libsystem_pthread.dylib 0x00007fff903c208d thread_start + 13
```
## Environment info
build command - make -j8 USE_DIST_KVSTORE=1
os: mac
```
What to do:
1. o/p of diagnose script:
88e9fe53272d:incubator-mxnet vikumar$ python /tmp/lk.py
----------Python Info----------
Version : 3.6.5
Compiler : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
Build : ('default', 'Apr 26 2018 08:42:37')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 10.0.1
Directory : /Users/vikumar/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/Users/vikumar/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
objc[59758]: Class CaptureDelegate is implemented in both /usr/local/opt/opencv/lib/libopencv_videoio.3.4.dylib (0x10ada5938) and /Users/vikumar/anaconda3/lib/python3.6/site-packages/cv2/cv2.cpython-36m-darwin.so (0x1a17065ce0). One of the two will be used. Which one is undefined.
Version : 1.3.0
Directory : /Users/vikumar/incubator-mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Darwin-16.7.0-x86_64-i386-64bit
system : Darwin
node : 88e9fe53272d.ant.amazon.com
release : 16.7.0
version : Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64
----------Hardware Info----------
machine : x86_64
processor : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0115 sec, LOAD: 0.5897 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0183 sec, LOAD: 0.2578 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0227 sec, LOAD: 0.1714 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0186 sec, LOAD: 0.2626 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0141 sec, LOAD: 0.3084 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0148 sec, LOAD: 0.0534 sec.
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services