You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/10/11 17:51:13 UTC

[GitHub] Vikas89 opened a new issue #12800: getting segfault while running train_cifar10.py program in example directory

Vikas89 opened a new issue #12800: getting segfault while running train_cifar10.py program in example directory
URL: https://github.com/apache/incubator-mxnet/issues/12800
 
 
   I am trying to run this command:
   python example/image-classification/train_cifar10.py 
   
   And getting the segfault. It is very consistent.
   ```
   INFO:root:Epoch[0] Batch [20]	Speed: 18.95 samples/sec	accuracy=0.164807
   INFO:root:Epoch[0] Batch [40]	Speed: 20.34 samples/sec	accuracy=0.246875
   INFO:root:Epoch[0] Batch [60]	Speed: 21.66 samples/sec	accuracy=0.308984
   INFO:root:Epoch[0] Batch [80]	Speed: 22.15 samples/sec	accuracy=0.328125
   INFO:root:Epoch[0] Batch [100]	Speed: 22.38 samples/sec	accuracy=0.362500
   INFO:root:Epoch[0] Batch [120]	Speed: 21.52 samples/sec	accuracy=0.378125
   INFO:root:Epoch[0] Batch [140]	Speed: 22.96 samples/sec	accuracy=0.417969
   INFO:root:Epoch[0] Batch [160]	Speed: 20.04 samples/sec	accuracy=0.426563
   INFO:root:Epoch[0] Batch [180]	Speed: 18.58 samples/sec	accuracy=0.430078
   INFO:root:Epoch[0] Batch [200]	Speed: 21.39 samples/sec	accuracy=0.443750
   INFO:root:Epoch[0] Batch [220]	Speed: 17.69 samples/sec	accuracy=0.469531
   INFO:root:Epoch[0] Batch [240]	Speed: 18.59 samples/sec	accuracy=0.469141
   INFO:root:Epoch[0] Batch [260]	Speed: 21.68 samples/sec	accuracy=0.470313
   INFO:root:Epoch[0] Batch [280]	Speed: 21.05 samples/sec	accuracy=0.487891
   INFO:root:Epoch[0] Batch [300]	Speed: 22.30 samples/sec	accuracy=0.503125
   INFO:root:Epoch[0] Batch [320]	Speed: 20.98 samples/sec	accuracy=0.534766
   INFO:root:Epoch[0] Batch [340]	Speed: 15.53 samples/sec	accuracy=0.526172
   INFO:root:Epoch[0] Batch [360]	Speed: 13.88 samples/sec	accuracy=0.528516
   Traceback (most recent call last):
     File "example/image-classification/train_cifar10.py", line 79, in <module>
       fit.fit(args, sym, data.get_rec_iter)
     File "/Users/vikumar/incubator-mxnet/example/image-classification/common/fit.py", line 333, in fit
       monitor=monitor)
     File "/Users/vikumar/incubator-mxnet/python/mxnet/module/base_module.py", line 563, in fit
       next_data_batch = next(data_iter)
     File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 228, in __next__
       return self.next()
     File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 840, in next
       check_call(_LIB.MXDataIterNext(self.handle, ctypes.byref(next_res)))
     File "/Users/vikumar/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [09:58:31] src/io/recordio_split.cc:29: Check failed: (reinterpret_cast<size_t>(end) & 3UL) == 0U (1 vs. 0) 
   
   Stack trace returned 10 entries:
   [bt] (0) 0   libmxnet.so                         0x000000010ae2cf30 dmlc::StackTrace() + 272
   [bt] (1) 1   libmxnet.so                         0x000000010ae2ccdf dmlc::LogMessageFatal::~LogMessageFatal() + 47
   [bt] (2) 2   libmxnet.so                         0x000000010c68a01c dmlc::io::RecordIOSplitter::FindLastRecordBegin(char const*, char const*) + 444
   [bt] (3) 3   libmxnet.so                         0x000000010c68d10c dmlc::io::InputSplitBase::ReadChunk(void*, unsigned long*) + 284
   [bt] (4) 4   libmxnet.so                         0x000000010c68d230 dmlc::io::InputSplitBase::Chunk::Load(dmlc::io::InputSplitBase*, unsigned long) + 144
   [bt] (5) 5   libmxnet.so                         0x000000010c6a0b8f dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()::operator()() const + 895
   [bt] (6) 6   libmxnet.so                         0x000000010c6a071d void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()> >(void*) + 45
   [bt] (7) 7   libsystem_pthread.dylib             0x00007fff903c293b _pthread_body + 180
   [bt] (8) 8   libsystem_pthread.dylib             0x00007fff903c2887 _pthread_body + 0
   [bt] (9) 9   libsystem_pthread.dylib             0x00007fff903c208d thread_start + 13
   ```
   
   
   
   ## Environment info
   build command -  make -j8 USE_DIST_KVSTORE=1
   os: mac
   
   ```
   What to do:
   1. o/p of diagnose script:
   88e9fe53272d:incubator-mxnet vikumar$ python /tmp/lk.py
   ----------Python Info----------
   Version      : 3.6.5
   Compiler     : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
   Build        : ('default', 'Apr 26 2018 08:42:37')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 10.0.1
   Directory    : /Users/vikumar/anaconda3/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   /Users/vikumar/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
     from ._conv import register_converters as _register_converters
   objc[59758]: Class CaptureDelegate is implemented in both /usr/local/opt/opencv/lib/libopencv_videoio.3.4.dylib (0x10ada5938) and /Users/vikumar/anaconda3/lib/python3.6/site-packages/cv2/cv2.cpython-36m-darwin.so (0x1a17065ce0). One of the two will be used. Which one is undefined.
   Version      : 1.3.0
   Directory    : /Users/vikumar/incubator-mxnet/python/mxnet
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   Platform     : Darwin-16.7.0-x86_64-i386-64bit
   system       : Darwin
   node         : 88e9fe53272d.ant.amazon.com
   release      : 16.7.0
   version      : Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64
   ----------Hardware Info----------
   machine      : x86_64
   processor    : i386
   b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
   b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
   b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
   b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0115 sec, LOAD: 0.5897 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0183 sec, LOAD: 0.2578 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0227 sec, LOAD: 0.1714 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0186 sec, LOAD: 0.2626 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0141 sec, LOAD: 0.3084 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0148 sec, LOAD: 0.0534 sec.
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services