You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/04/30 21:02:28 UTC

[GitHub] [incubator-mxnet] brian-mann-math opened a new issue #18209: BatchNorm backward(train_mode=False) incorrect behavior on context mx.gpu()

brian-mann-math opened a new issue #18209:
URL: https://github.com/apache/incubator-mxnet/issues/18209


   ## Description
   When training DeepDream style CNN visualizations using VGG16_bn, I noticed that the results did not seem correct compared to PyTorch. Running an apples-to-apples comparison, I discovered that the problem does not exist for pretrained CNNs without BatchNorm.
   
   Furthermore, after some experimenting with train and predict mode, I have narrowed the problem to a bug in nn.BatchNorm .backward(train_mode=False) when nn.BatchNorm is initialized on ctx=mx.gpu(). 
   
   In the example below, you can see BatchNorm works fine on mx.cpu() but not mx.gpu(). If you compare to PyTorch (either on gpu or cpu) you will see results very close to my example run on cpu.
   
   ### Error Message
   
   There is no error generated - the gradient is just computed incorrectly on GPU.
   
   ## To Reproduce
   
   Running MXNet version 1.6.0. Installed via `pip install mxnet-cu101mkl`.
   
   ```python
   import mxnet as mx
   from mxnet import gluon, autograd, nd
   from mxnet.gluon import nn
   
   import numpy as np
   
   def test(ctx):
       print("Context:", ctx)
       bn = nn.BatchNorm()
       bn.initialize(ctx=ctx)
       # For apples-to-apples comparison with PyTorch
       # create a numpy array with a fixed random seed
       # and convert to ndarray or torch tensor as needed
       np.random.seed(42) 
       x = np.random.randn(16, 100)
       x = nd.array(x, ctx=ctx)
       
       x.attach_grad()
       with autograd.record():
           y = bn(x).norm()
       y.backward()
       print('Gradient with .backward(train_mode=True):')
       print(x.grad)
       
       x.attach_grad()
       with autograd.record():
           y = bn(x).norm()
       y.backward(train_mode=False)
       print('Gradient with .backward(train_mode=False):')
       print(x.grad)
   
   test(mx.gpu())
   print('\n')
   test(mx.cpu())
   ```
   this should output
   
   ```
   Context: gpu(0)
   Gradient with .backward(train_mode=True):
   
   [[ 1.54331431e-07 -9.90738691e-09  2.17082999e-07 ...  3.70418292e-08
     -4.56034911e-07 -1.70796298e-07]
    [-5.31670935e-07 -6.50823111e-08 -2.83676826e-07 ...  1.12570149e-08
     -3.72994634e-07 -7.78542471e-07]
    [ 1.03611626e-07  1.26948521e-07  4.37038494e-07 ...  4.80843418e-08
      8.25267421e-07  4.09489786e-07]
    ...
    [-2.66233769e-07  2.10875697e-07 -1.98560244e-07 ... -2.59636550e-07
      1.43746172e-06 -4.55288415e-07]
    [-4.00260944e-07  1.21869405e-07  4.91964613e-07 ...  2.66210179e-07
      1.15469061e-06  3.84336460e-07]
    [ 2.54435122e-07 -9.07216915e-08 -5.26608744e-07 ...  4.71994440e-07
     -4.30146486e-07 -5.01510669e-07]]
   <NDArray 16x100 @gpu(0)>
   Gradient with .backward(train_mode=False):
   
   [[ 1.5412704e-07 -9.9587867e-09  2.1744940e-07 ...  3.6937106e-08
     -4.5603929e-07 -1.7041563e-07]
    [-5.3151490e-07 -6.5216888e-08 -2.8525079e-07 ...  1.1193077e-08
     -3.7290255e-07 -7.7769141e-07]
    [ 1.0385542e-07  1.2743682e-07  4.3893598e-07 ...  4.8083859e-08
      8.2525179e-07  4.0822903e-07]
    ...
    [-2.6548534e-07  2.1083363e-07 -1.9969134e-07 ... -2.5873848e-07
      1.4370964e-06 -4.5389191e-07]
    [-3.9940915e-07  1.2190962e-07  4.9197621e-07 ...  2.6582453e-07
      1.1571157e-06  3.8428274e-07]
    [ 2.5489476e-07 -9.0713620e-08 -5.2661966e-07 ...  4.7046700e-07
     -4.3135498e-07 -5.0008794e-07]]
   <NDArray 16x100 @gpu(0)>
   
   
   Context: cpu(0)
   Gradient with .backward(train_mode=True):
   
   [[ 1.52795309e-07 -9.42009581e-09  2.20310852e-07 ...  3.75795253e-08
     -4.56473600e-07 -1.71663274e-07]
    [-5.25615860e-07 -6.48453096e-08 -2.87071714e-07 ...  1.17580106e-08
     -3.70792492e-07 -7.82021573e-07]
    [ 1.02882176e-07  1.28814335e-07  4.42847067e-07 ...  4.88764371e-08
      8.30220586e-07  4.10084482e-07]
    ...
    [-2.60770662e-07  2.13814275e-07 -2.01395267e-07 ... -2.60059522e-07
      1.44180660e-06 -4.55384509e-07]
    [-3.95230529e-07  1.24432887e-07  4.94030417e-07 ...  2.67437116e-07
      1.16999058e-06  3.86242363e-07]
    [ 2.52621589e-07 -9.11339484e-08 -5.34086894e-07 ...  4.72164828e-07
     -4.29882903e-07 -5.05452988e-07]]
   <NDArray 16x100 @cpu(0)>
   Gradient with .backward(train_mode=False):
   
   [[ 0.01185271 -0.0011071   0.01301745 ...  0.0038546  -0.01176191
     -0.00832031]
    [-0.04086535 -0.00770687 -0.01701735 ...  0.00120027 -0.00958996
     -0.03797242]
    [ 0.00802236  0.01523095  0.02622019 ...  0.00499825  0.02128031
      0.01989006]
    ...
    [-0.0202621   0.02531023 -0.01193004 ... -0.02662995  0.03713372
     -0.02210556]
    [-0.03070656  0.01466694  0.02933322 ...  0.02728209  0.0299198
      0.01867896]
    [ 0.019618   -0.01075783 -0.03143681 ...  0.04828149 -0.01112048
     -0.02442675]]
   <NDArray 16x100 @cpu(0)>
   ```
   
   ## Environment
   
   We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
   ```
   curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python
   
   # paste outputs here
   ```
   ----------Python Info----------
   Version      : 3.6.5
   Compiler     : GCC 7.2.0
   Build        : ('default', 'Apr 29 2018 16:14:56')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 10.0.1
   Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.6.0
   Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
   Num GPUs     : 1
   Commit Hash   : 6eec9da55c5096079355d1f1a5fa58dcf35d6752
   ----------System Info----------
   Platform     : Linux-4.14.171-105.231.amzn1.x86_64-x86_64-with-glibc2.9
   system       : Linux
   node         : ip-172-16-165-164
   release      : 4.14.171-105.231.amzn1.x86_64
   version      : #1 SMP Thu Feb 27 23:49:15 UTC 2020
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                4
   On-line CPU(s) list:   0-3
   Thread(s) per core:    2
   Core(s) per socket:    2
   Socket(s):             1
   NUMA node(s):          1
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 79
   Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
   Stepping:              1
   CPU MHz:               1812.110
   CPU max MHz:           3000.0000
   CPU min MHz:           1200.0000
   BogoMIPS:              4600.14
   Hypervisor vendor:     Xen
   Virtualization type:   full
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              46080K
   NUMA node0 CPU(s):     0-3
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.5285 sec.
   Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0005 sec, LOAD: 0.5373 sec.
   Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1000 sec, LOAD: 0.1116 sec.
   Timing for D2L: http://d2l.ai, DNS: 0.0091 sec, LOAD: 0.3228 sec.
   Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0350 sec, LOAD: 0.1672 sec.
   Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0486 sec, LOAD: 0.3785 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0022 sec, LOAD: 0.1064 sec.
   Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.002210855484008789 sec.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org