You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/09/26 10:36:40 UTC

[GitHub] al-rigazzi commented on issue #10865: A potential race condition in the executor or engine.

al-rigazzi commented on issue #10865: A potential race condition in the executor or engine.
URL: https://github.com/apache/incubator-mxnet/issues/10865#issuecomment-424668146
 
 
   @zheng-da sometimes, using very large batches, I observe NaN values at the first training iteration. The phenomenon is much more frequent when I use more OMP threads and the network is large. For example, if I use more than 20 OMP threads with VGG 16 and 1024 samples per batch (on a single node), I get NaN's 10% of the times.
   
   I think this could be due to a race condition when allocating/copying MKLDNN memory. Do you think it makes sense? Do you know what functions I should try monitor to find the root of the problem?
   
   Thanks,
   Al

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services