You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/03/27 16:50:22 UTC

[GitHub] ajayvohra2005 opened a new issue #10272: MxNet dmlc error that goes away on retry

ajayvohra2005 opened a new issue #10272: MxNet dmlc error that goes away on retry 
URL: https://github.com/apache/incubator-mxnet/issues/10272
 
 
   MxNet 1.0 SSD branch compiled from source as follows
   
   make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_DIST_KVSTORE=1 USE_S3=1
   
   Sometimes it gives dmlc error shown in stack trace below, that goes away on immediate retry.
   
   <module 'mxnet' from '/home/ubuntu/mxnet/python/mxnet/__init__.pyc'>
   Warning: using pre-installed version of mxnet may cause unexpected error...
   (export MXNET_EXAMPLE_SSD_DISABLE_PRE_INSTALLED=1) to prevent loading pre-installed mxnet.
   [16:05:43] src/io/s3_filesys.cc:903: No AWS Region set, using default region us-east-1
   [16:05:45] src/io/iter_image_det_recordio.cc:110: Loaded ImageList from s3://argo-ai-aws-ml/training_data/train.lst 11936 Image records
   [16:05:45] src/io/iter_image_det_recordio.cc:281: ImageDetRecordIOParser: s3://argo-ai-aws-ml/training_data/train.rec, use 31 threads for decoding..
   [16:05:46] src/io/iter_image_det_recordio.cc:334: ImageDetRecordIOParser: s3://argo-ai-aws-ml/training_data/train.rec, label padding width: 400
   WARNING:root:data_shape 896 was not tested, use with caution.
   INFO:root:Start training with (gpu(0),gpu(1),gpu(2),gpu(3)) from pretrained model /efs/argonet/output/adam/exp1/ssd
   WARNING:root:data_shape 1664 was not tested, use with caution.
   [16:05:57] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   [16:06:01] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   [16:06:05] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   [16:06:12] /home/ubuntu/mxnet/dmlc-core/include/dmlc/logging.h:308: [16:06:12] src/kvstore/././kvstore_dist_server.h:378: Check failed: !stored.is_none() init 0 first
   
   Stack trace returned 8 entries:
   [bt] (0) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f7f9153f80c]
   [bt] (1) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore17KVStoreDistServer22DefaultStorageResponseEiRKNS_7NDArrayERKN2ps6KVMetaERKNS5_7KVPairsIfEEPNS5_8KVServerIfEE+0x5ee) [0x7f7f9405ffae]
   [bt] (2) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore17KVStoreDistServer17DataHandleDefaultERKN2ps6KVMetaERKNS2_7KVPairsIfEEPNS2_8KVServerIfEE+0x2db) [0x7f7f94060bab]
   [bt] (3) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8KVServerIfE7ProcessERKNS_7MessageE+0x10b) [0x7f7f9404ab1b]
   [bt] (4) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8Customer9ReceivingEv+0x55c) [0x7f7f940bdedc]
   [bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f7f9f1a2c80]
   [bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f7fa40046ba]
   [bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f7fa3d3a41d]
   
   terminate called after throwing an instance of 'dmlc::Error'
     what():  [16:06:12] src/kvstore/././kvstore_dist_server.h:378: Check failed: !stored.is_none() init 0 first
   
   Stack trace returned 8 entries:
   [bt] (0) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f7f9153f80c]
   [bt] (1) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore17KVStoreDistServer22DefaultStorageResponseEiRKNS_7NDArrayERKN2ps6KVMetaERKNS5_7KVPairsIfEEPNS5_8KVServerIfEE+0x5ee) [0x7f7f9405ffae]
   [bt] (2) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore17KVStoreDistServer17DataHandleDefaultERKN2ps6KVMetaERKNS2_7KVPairsIfEEPNS2_8KVServerIfEE+0x2db) [0x7f7f94060bab]
   [bt] (3) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8KVServerIfE7ProcessERKNS_7MessageE+0x10b) [0x7f7f9404ab1b]
   [bt] (4) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8Customer9ReceivingEv+0x55c) [0x7f7f940bdedc]
   [bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f7f9f1a2c80]
   [bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f7fa40046ba]
   [bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f7fa3d3a41d]

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services