You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/01/16 06:47:00 UTC
[GitHub] FCInter opened a new issue #13902: Loss becomes NaN when setting
use_global_stat=True for batchnorm
FCInter opened a new issue #13902: Loss becomes NaN when setting use_global_stat=True for batchnorm
URL: https://github.com/apache/incubator-mxnet/issues/13902
## Description
I trained a model and used it to perform prediction. While building the predictor, if I set the argument for_training=False, the prediction result is bad, as bad as predicted using a randomly initialized model.
## Environment info (Required)
```
----------Python Info----------
('Version :', '2.7.12')
('Compiler :', 'GCC 5.4.0 20160609')
('Build :', ('default', 'Dec 4 2017 14:50:18'))
('Arch :', ('64bit', ''))
------------Pip Info-----------
('Version :', '18.1')
('Directory :', '/path/to/mx_env/local/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version :', '1.3.0')
('Directory :', '/path/to/mx_env/local/lib/python2.7/site-packages/mxnet')
('Commit Hash :', 'b3be92f4a48bce62a5a8424271871c2f81c8f7f1')
----------System Info----------
('Platform :', 'Linux-4.4.0-87-generic-x86_64-with-Ubuntu-16.04-xenial')
('system :', 'Linux')
('node :', 'B22-C09-G5500-01-GPU')
('release :', '4.4.0-87-generic')
('version :', '#110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017')
----------Hardware Info----------
('machine :', 'x86_64')
('processor :', 'x86_64')
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 88
On-line CPU(s) list: 0-87
Thread(s) per core: 2
Core(s) per socket: 22
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz
Stepping: 1
CPU MHz: 2400.093
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 4801.21
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 56320K
NUMA node0 CPU(s): 0-21,44-65
NUMA node1 CPU(s): 22-43,66-87
```
Package used (Python/R/Scala/Julia):
Python
## Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio):
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Build config:
I use pip install.
## Error Message:
the training log:
```
Epoch[0] Batch [178] Speed: 16.56 samples/sec Train-RPNAcc=0.870976, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.990496, RCNNLogLoss=nan, RCNNL1Loss=nan,
Epoch[0] Batch [179] Speed: 14.71 samples/sec Train-RPNAcc=0.871275, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.990522, RCNNLogLoss=nan, RCNNL1Loss=nan,
```
## Minimum reproducible example
(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
This is how I build a resnet-50 model.
```
def residual_unit(self, data, num_filter, stride, dim_match, name, bottle_neck=True, bn_mom=0.9, workspace=256, memonger=False):
"""Return ResNet Unit symbol for building ResNet
Parameters
----------
data : str
Input data
num_filter : int
Number of output channels
bnf : int
Bottle neck channels factor with regard to num_filter
stride : tuple
Stride used in convolution
dim_match : Boolean
True means channel number between input and output is the same, otherwise means differ
name : str
Base name of the operators
workspace : int
Workspace used in convolution operator
"""
if bottle_neck:
# the same as https://github.com/facebook/fb.resnet.torch#notes, a bit difference with origin paper
bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, eps=self.eps, momentum=bn_mom, name=name + '_bn1', use_global_stats=self.use_global_stats)
act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1')
conv1 = mx.sym.Convolution(data=act1, num_filter=int(num_filter*0.25), kernel=(1,1), stride=(1,1), pad=(0,0),
no_bias=True, workspace=workspace, name=name + '_conv1')
bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=self.eps, momentum=bn_mom, name=name + '_bn2', use_global_stats=self.use_global_stats)
act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2')
conv2 = mx.sym.Convolution(data=act2, num_filter=int(num_filter*0.25), kernel=(3,3), stride=stride, pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv2')
bn3 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=self.eps, momentum=bn_mom, name=name + '_bn3', use_global_stats=self.use_global_stats)
act3 = mx.sym.Activation(data=bn3, act_type='relu', name=name + '_relu3')
conv3 = mx.sym.Convolution(data=act3, num_filter=num_filter, kernel=(1,1), stride=(1,1), pad=(0,0), no_bias=True,
workspace=workspace, name=name + '_conv3')
if dim_match:
shortcut = data
else:
shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
workspace=workspace, name=name+'_sc')
if memonger:
shortcut._set_attr(mirror_stage='True')
return conv3 + shortcut
else:
bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, momentum=bn_mom, eps=self.eps, name=name + '_bn1', use_global_stats=self.use_global_stats)
act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1')
conv1 = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(3,3), stride=stride, pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv1')
bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, momentum=bn_mom, eps=self.eps, name=name + '_bn2', use_global_stats=self.use_global_stats)
act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2')
conv2 = mx.sym.Convolution(data=act2, num_filter=num_filter, kernel=(3,3), stride=(1,1), pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv2')
if dim_match:
shortcut = data
else:
shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
workspace=workspace, name=name+'_sc')
if memonger:
shortcut._set_attr(mirror_stage='True')
return conv2 + shortcut
def resnet(self, data, units, num_stages, filter_list, num_classes, bottle_neck=True, bn_mom=0.9, workspace=256, dtype='float32', memonger=False):
"""Return ResNet symbol of
Parameters
----------
units : list
Number of units in each stage
num_stages : int
Number of stage
filter_list : list
Channel size of each stage
num_classes : int
Ouput size of symbol
dataset : str
Dataset type, only cifar10 and imagenet supports
workspace : int
Workspace used in convolution operator
dtype : str
Precision (float32 or float16)
"""
num_unit = len(units)
assert(num_unit == num_stages)
body = mx.sym.Convolution(data=data, num_filter=filter_list[0], kernel=(7, 7), stride=(2,2), pad=(3, 3),
no_bias=True, name="conv0", workspace=workspace)
body = mx.sym.BatchNorm(data=body, fix_gamma=False, eps=self.eps, momentum=bn_mom, name='bn0', use_global_stats=self.use_global_stats)
body = mx.sym.Activation(data=body, act_type='relu', name='relu0')
body = mx.sym.Pooling(data=body, kernel=(3, 3), stride=(2,2), pad=(1,1), pool_type='max')
for i in range(num_stages):
stride = (2, 2)
if i == num_stages - 1 or i == 0:
stride = (1, 1)
body = self.residual_unit(body, filter_list[i+1], stride, False,
name='stage%d_unit%d' % (i + 1, 1), bottle_neck=bottle_neck, workspace=workspace,
memonger=memonger)
for j in range(units[i]-1):
body = self.residual_unit(body, filter_list[i+1], (1,1), True, name='stage%d_unit%d' % (i + 1, j + 2),
bottle_neck=bottle_neck, workspace=workspace, memonger=memonger)
feat_conv_3x3 = mx.sym.Convolution(
data=body, kernel=(3, 3), pad=(6, 6), dilate=(6, 6), num_filter=1024, name="feat_conv_3x3")
feat_conv_3x3_relu = mx.sym.Activation(data=feat_conv_3x3, act_type="relu", name="feat_conv_3x3_relu") # ('feat_conv_3x3_relu.shape', [(1L, 1024L, 38L, 50L)])
return feat_conv_3x3_relu
def get_resnet_symbol(self, data, num_classes=2, dtype='float32'):
"""
Adapted from https://github.com/tornadomeet/ResNet/blob/master/train_resnet.py
Original author Wei Wu
"""
num_layers = self.num_layers
if num_layers >= 50:
filter_list = [64, 256, 512, 1024, 2048]
bottle_neck = True
else:
filter_list = [64, 64, 128, 256, 512]
bottle_neck = False
num_stages = 4
if num_layers == 18:
units = [2, 2, 2, 2]
elif num_layers == 34:
units = [3, 4, 6, 3]
elif num_layers == 50:
units = [3, 4, 6, 3]
elif num_layers == 101:
units = [3, 4, 23, 3]
elif num_layers == 152:
units = [3, 8, 36, 3]
elif num_layers == 200:
units = [3, 24, 36, 3]
elif num_layers == 269:
units = [3, 30, 48, 8]
else:
raise ValueError("no experiments done on num_layers {}, you can do it yourself".format(num_layers))
return self.resnet( data = data,
units = units,
num_stages = num_stages,
filter_list = filter_list,
num_classes = num_classes,
bottle_neck = bottle_neck,
workspace = self.workspace,
dtype = dtype)
```
## Steps to reproduce
(Paste the commands you ran that produced the error.)
1. For all the `Batchnorm` layers, if the `self.use_global_stats` is `False`, then everything goes fine. Training loss keeps going down and training acc increases. However, if `self.use_global_stats` is `True`, the training loss becomes `NaN` as is shown in the error message.
2. I loaded a pretrained resnet-50 checkpoint, which is downloaded from [here](http://data.dmlc.ml/mxnet/models/imagenet/resnet/50-layers/).
What's wrong with my code?
Thank you all for helping me!!!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services