You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/10/31 05:34:15 UTC
[GitHub] varunrajk opened a new issue #13054: Multiple trainers within a
single worker using a distributed KVStore
varunrajk opened a new issue #13054: Multiple trainers within a single worker using a distributed KVStore
URL: https://github.com/apache/incubator-mxnet/issues/13054
## Description
The Gluon Trainer `step` method uses enumerations as keys to push and pull gradients/parameters from kvstore. Using two trainers within a single worker script (in a distributed learning setting) can cause an issue because each trainer uses the same set of keys on the distributed KVStore.
## Minimum reproducible example
Here is a simple example script that demonstrates this issue:
```
import mxnet as mx
import numpy as np
from mxnet import autograd
def trainer_test(kvstore, problem_descr):
# model
m = mx.gluon.nn.Dense(1, use_bias=False)
m.collect_params().initialize(mx.init.Constant(problem_descr['init']), ctx=mx.cpu())
# trainer
trainer = mx.gluon.Trainer(params=m.collect_params(),
optimizer='sgd',
optimizer_params={'learning_rate': problem_descr['lr']},
kvstore=kvstore)
# update parameter
with autograd.record():
y = m(mx.nd.ones((1, 1)) * problem_descr['x'])
loss_a = mx.nd.abs(problem_descr['target'] - y)
loss_a.backward()
trainer.step(1)
# get new parameter value
v = np.asscalar(m.collect_params().get('weight').list_data()[0].asnumpy())
# get expected value
grad = np.sign(problem_descr['target'] - (problem_descr['init'] * problem_descr['x'])) * problem_descr['x']
exp_v = problem_descr['init'] + grad * problem_descr['lr'] * kvstore.num_workers
# print the updated parameter value and the expected value
if kv.rank == 0:
print(f'updated paramter value: {np.round(v, 3)}, expected value: {exp_v}')
if __name__ == '__main__':
kv = mx.kv.create('dist_sync')
trainer_test(kvstore=kv, problem_descr={'lr': 0.1, 'init': 1., 'x': 2, 'target': 4})
trainer_test(kvstore=kv, problem_descr={'lr': 0.1, 'init': 3., 'x': 2, 'target': 4})
```
The trainer_test method tests a single step update on a simplified regression problem. It takes as inputs
- a distributed kvstore, and
- a problem description dict that creates a new regression problem to optimize.
## Steps to reproduce
Execute the script by using mxnet's `launch.py` tool with 2 or more workers and a `local` launcher as follows:
```
launch.py -n 2 --launcher local python3 trainer_test.py
```
The script execution freezes after evaluating the first `trainer_test` call.
## What have you tried to solve it?
Replacing the kvstore keys in the trainer to use unique parameter names (for example, `param.name`) solves this issue.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services