You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/03/22 04:52:20 UTC
[GitHub] Northrend opened a new issue #10193: training job crashed with "RecordIOWriter error" on V100 multi-gpu node

Northrend opened a new issue #10193: training job crashed with "RecordIOWriter error" on V100 multi-gpu node
URL: https://github.com/apache/incubator-mxnet/issues/10193
 
 
   ## Description
   Hi, my simple training job crashed and mxnet threw out an **RecordIOWriter** error.
   Basically, i ran a training job to do an image classification task, with two recordio files. One for training and one for validation. **The rec file is quite large, more than 800G and contains about 12 million samples.**
   When i ran this job on an 8-card nvidia P100-PCIE node, with cuda 8.0 and cudnn 5.1, it finished successfully. **And now it crashed on an 8-card nvidia V100 node with cuda 9.1 & cudnn 7,while code and .rec file are absolutely the same one**.
   
   Then I ran a script just to traverse the rec file on V100 Node, the same kmagic error occurred again, 
   **here is the error log:**
   ```
   ...
   443632th batch: (1L, 3L, 224L, 224L)
   [12:26:01] include/dmlc/././logging.h:308: [12:26:01] src/recordio.cc:117: Check failed: p[0] == RecordIOWriter::kMagic
   
   Stack trace returned 5 entries:
   [bt] (0) /opt/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc19RecordIOChunkReader10NextRecordEPNS_10InputSplit4BlobE+0x306) [0x7f10d1664116]
   [bt] (1) /opt/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2d98db8) [0x7f10d0f3edb8]
   [bt] (2) /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0xf43e) [0x7f10df6fe43e]
   [bt] (3) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f10e9b8d6ba]
   [bt] (4) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f10e98c341d]
   
   terminate called after throwing an instance of 'dmlc::Error'
     what():  [12:26:01] src/recordio.cc:117: Check failed: p[0] == RecordIOWriter::kMagic
   
   Stack trace returned 5 entries:
   [bt] (0) /opt/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc19RecordIOChunkReader10NextRecordEPNS_10InputSplit4BlobE+0x306) [0x7f10d1664116]
   [bt] (1) /opt/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2d98db8) [0x7f10d0f3edb8]
   [bt] (2) /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0xf43e) [0x7f10df6fe43e]
   [bt] (3) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f10e9b8d6ba]
   [bt] (4) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f10e98c341d]
   ```
   ## Environment info (Required)
   I am using:
   mxnet 1.0.0
   python 2.7
   opencv 3.4.0
   cuda 9.1.85
   cudnn 7.05
   nvidia driver 390.30
   ubuntu 16.04
   
   ##
   May this error occurs on CUDA 9.1 env or when the training recordio file is considerably large?
   Ask for more information if necessary

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services