You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/08/01 06:06:36 UTC

[GitHub] jiajinyu opened a new issue #11962: Distributed and local train have different results when training a wide-deep network.

jiajinyu opened a new issue #11962: Distributed and local train have different results when training a wide-deep network. 
URL: https://github.com/apache/incubator-mxnet/issues/11962
 
 
   ## Description
   We see small but noticeable difference of outputs when we trained a wide-deep network. The difference only happened when we have both deep and wide parts of the network. We only used one worker and have the local command as  `OMP_NUM_THREADS=4 MXNET_CPU_WORKER_NTHREADS=24 python train.py some_other_params` and  distributed command as `OMP_NUM_THREADS=4 MXNET_CPU_WORKER_NTHREADS=24  incubator-mxnet/tools/launch.py  -n 1 --launcher=local python 
   train.py some_other_params --kvstore dist_async`
   
   To find out where the diff occurs, we have code like this to log the values/states
   ```
   suffix = 'single' if args.kvstore is None else args.kvstore
   mod.save_params('module-train-checkpoint-{}-{}.params'.format(suffix, nbatch))
   mx.nd.save('batch_{}_{}_inputs.ndarray'.format(nbatch, suffix), batch.data)
   mod.forward(batch)
   outputs = mod.get_outputs()[0]
   mx.nd.save('batch_{}_{}_outputs.ndarray'.format(nbatch, suffix), outputs)
   ```
   **What we see is that both params and inputs are the same**, but the outputs from distributed and local train are different, which is really strange to me.
   
   May I ask how to debug and check from here? Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services