You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/11/09 06:39:18 UTC
[GitHub] [incubator-mxnet] wy3406 opened a new issue #19498: SyncBN causes the memory to gradually increase with iteration
wy3406 opened a new issue #19498:
URL: https://github.com/apache/incubator-mxnet/issues/19498
## Description
(A clear and concise description of what the bug is.)
- I have a few issues/questions regarding SyncBN
When using BN training in custom image segmentation, the memory is normal. But when I replaced BN with SyncBN, I found that the GPU memory gradually increased with iteration until it occupied the entire GPU memory,then the training is stuck. I try to use a smaller batch than BN, which also takes up all the GPU memory.
Note there is no warning when I use SyncBN.
Is there something I have missed?
- Environments: Python 3.6.9 ; TITAN RTX × 8;CUDA 10.1
- Framework: mxnet-cu101-1.7.0 and gluoncv-0.8.0
### Error Message
(Paste the complete error message. Please also include stack trace by setting environment variable `DMLC_LOG_STACK_TRACE_DEPTH=100` before running your script.)
## To Reproduce
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
### Steps to reproduce
(Paste the commands you ran that produced the error.)
1.
2.
## What have you tried to solve it?
1.
2.
## Environment
***We recommend using our script for collecting the diagnostic information with the following command***
`curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3`
<details>
<summary>Environment Information</summary>
```
# Paste the diagnose.py command output here
```
</details>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] kohillyang commented on issue #19498: SyncBN causes the memory to gradually increase with iteration
Posted by GitBox <gi...@apache.org>.
kohillyang commented on issue #19498:
URL: https://github.com/apache/incubator-mxnet/issues/19498#issuecomment-732689325
Is it because of of DataParallelModel? Since muti-threading training && multi-processing training is not supported by mxnet. To speed the training up, I suggest you trying horovod instead.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] wy3406 commented on issue #19498: SyncBN causes the memory to gradually increase with iteration
Posted by GitBox <gi...@apache.org>.
wy3406 commented on issue #19498:
URL: https://github.com/apache/incubator-mxnet/issues/19498#issuecomment-724387973
@leezu In the following example, nvidia-smi shows that the memory grows slowly as the iteration progresses
```
from tqdm import tqdm
from mxnet import gluon, autograd
from mxnet.gluon import nn
from mxnet.gluon.data import dataset
from gluoncv.utils.parallel import DataParallelCriterion,DataParallelModel
import mxnet.ndarray as nd
import mxnet as mx
import numpy as np
class Activation(nn.HybridBlock):
"""Activation function used in MobileNetV3"""
def __init__(self, act_func, **kwargs):
super(Activation, self).__init__(**kwargs)
if act_func == "relu":
self.act = nn.Activation('relu')
elif act_func == "relu6":
self.act = ReLU6()
elif act_func == "hard_sigmoid":
self.act = HardSigmoid()
elif act_func == "swish":
self.act = nn.Swish()
elif act_func == "leaky":
self.act = nn.LeakyReLU(alpha=0.375)
else:
raise NotImplementedError
def hybrid_forward(self, F, x):
return self.act(x)
def ConvBlock(in_channels,out_channels,
kernel_size=1,strides=1,padding=0,num_groups=1,
use_act=True,act_type='relu',
name_prefix='ConvBlock_Act_',
use_bias=False,
conv2d=nn.Conv2D,
norm_layer=nn.BatchNorm,norm_kwargs=None):
out = nn.HybridSequential()
with out.name_scope():
out.add(conv2d(in_channels=in_channels,channels=out_channels,kernel_size=kernel_size,strides=strides,padding=padding,use_bias=use_bias,groups=num_groups)
,norm_layer(in_channels=out_channels,**({} if norm_kwargs is None else norm_kwargs))
)
if use_act:
out.add(Activation(act_type))
return out
class Net(nn.HybridBlock):
def __init__(self,norm_layer,norm_kwargs):
super(Net, self).__init__(prefix='')
self.features= nn.HybridSequential()
self.features.add(ConvBlock(3,256,
kernel_size=3,strides=1,padding=1,num_groups=1,
use_act=True,act_type='relu',
name_prefix='ConvBlock_Act_',
use_bias=False,
conv2d=nn.Conv2D,
norm_layer=norm_layer,norm_kwargs=norm_kwargs),
ConvBlock(256,512,
kernel_size=3,strides=2,padding=1,num_groups=1,
use_act=True,act_type='relu',
name_prefix='ConvBlock_Act_',
use_bias=False,
conv2d=nn.Conv2D,
norm_layer=norm_layer,norm_kwargs=norm_kwargs),
ConvBlock(512,512,
kernel_size=3,strides=2,padding=1,num_groups=1,
use_act=True,act_type='relu',
name_prefix='ConvBlock_Act_',
use_bias=False,
conv2d=nn.Conv2D,
norm_layer=norm_layer,norm_kwargs=norm_kwargs),
ConvBlock(512,512,
kernel_size=3,strides=2,padding=1,num_groups=1,
use_act=True,act_type='relu',
name_prefix='ConvBlock_Act_',
use_bias=False,
conv2d=nn.Conv2D,
norm_layer=norm_layer,norm_kwargs=norm_kwargs),
ConvBlock(512,1024,
kernel_size=3,strides=2,padding=1,num_groups=1,
use_act=True,act_type='relu',
name_prefix='ConvBlock_Act_',
use_bias=False,
conv2d=nn.Conv2D,
norm_layer=norm_layer,norm_kwargs=norm_kwargs),
ConvBlock(1024,1024,
kernel_size=3,strides=2,padding=1,num_groups=1,
use_act=True,act_type='relu',
name_prefix='ConvBlock_Act_',
use_bias=False,
conv2d=nn.Conv2D,
norm_layer=norm_layer,norm_kwargs=norm_kwargs),
)
self.features.add(nn.GlobalAvgPool2D())
self.features.add(nn.Flatten())
self.fc = nn.Dense(1, in_units=1024, use_bias=False)
def hybrid_forward(self,F, x):
x=self.features(x)
out = self.fc(x)
return out
class TestData(dataset.Dataset):
def __init__(self,):
self.Number=1e5
def __len__(self):
return self.Number
def __getitem__(self, idx):
inp,tag=self.gen_data()
inp=nd.array(inp,dtype=np.float32)
tag=nd.array(tag,dtype=np.float32)
return inp,tag
def gen_data(self):
X = np.random.randn(3*512*512,1).reshape(3,512,512)
Y =np.random.randn(1)
return X, Y
ngpus=4
_ctx=[mx.gpu(i) for i in range(ngpus)]
_batch_size=20
norm_kwargs ={'num_devices': ngpus}
usesyncbn=True
model=Net(norm_layer=mx.gluon.contrib.nn.SyncBatchNorm,norm_kwargs=norm_kwargs)
model.initialize(mx.init.MSRAPrelu(),ctx=_ctx)
net = DataParallelModel(model,_ctx, usesyncbn)
criterion = DataParallelCriterion(mx.gluon.loss.L1Loss(), _ctx, usesyncbn)
update_params=net.module.collect_params()
optimizer=mx.gluon.Trainer(update_params,'adam',{'learning_rate': 0.001},mx.kvstore.create())
train_dataset=TestData()
train_data = gluon.data.DataLoader(train_dataset, _batch_size,
shuffle=True, last_batch='rollover',
num_workers=4,
pin_memory=False)
for j in range(1000):
tbar=tqdm(train_data)
for i, idatas in enumerate(tbar):
with autograd.record(True):
ipt,targ=idatas
oupt=net(ipt)
losses=criterion(oupt,targ)
mx.nd.waitall()
autograd.backward(losses)
optimizer.step(_batch_size)
tbar.set_description()
mx.nd.waitall()
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] leezu commented on issue #19498: SyncBN causes the memory to gradually increase with iteration
Posted by GitBox <gi...@apache.org>.
leezu commented on issue #19498:
URL: https://github.com/apache/incubator-mxnet/issues/19498#issuecomment-724130801
Please provide a short example that can reproduce the bug. That will make it easier to identify and fix the memory leak
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org