You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/10/31 05:34:15 UTC
[GitHub] varunrajk opened a new issue #13054: Multiple trainers within a single worker using a distributed KVStore

varunrajk opened a new issue #13054: Multiple trainers within a single worker using a distributed KVStore
URL: https://github.com/apache/incubator-mxnet/issues/13054
 
 
   ## Description
   The Gluon Trainer `step` method uses enumerations as keys to push and pull gradients/parameters from kvstore. Using two trainers within a single worker script (in a distributed learning setting) can cause an issue because each trainer uses the same set of keys on the distributed KVStore. 
   
   
   ## Minimum reproducible example
   Here is a simple example script that demonstrates this issue:
   
   ```
   import mxnet as mx
   import numpy as np
   from mxnet import autograd
   
   def trainer_test(kvstore, problem_descr):
       # model
       m = mx.gluon.nn.Dense(1, use_bias=False)
       m.collect_params().initialize(mx.init.Constant(problem_descr['init']), ctx=mx.cpu())
   
       # trainer
       trainer = mx.gluon.Trainer(params=m.collect_params(),
                                                 optimizer='sgd',
                                                 optimizer_params={'learning_rate': problem_descr['lr']},
                                                 kvstore=kvstore)
   
       # update parameter
       with autograd.record():
           y = m(mx.nd.ones((1, 1)) * problem_descr['x'])
           loss_a = mx.nd.abs(problem_descr['target'] - y)
   
       loss_a.backward()
   
       trainer.step(1)
   
       # get new parameter value
       v = np.asscalar(m.collect_params().get('weight').list_data()[0].asnumpy())
   
       # get expected value
       grad = np.sign(problem_descr['target'] - (problem_descr['init'] * problem_descr['x'])) * problem_descr['x']
       exp_v = problem_descr['init'] + grad * problem_descr['lr'] * kvstore.num_workers
   
       # print the updated parameter value and the expected value
       if kv.rank == 0:
           print(f'updated paramter value: {np.round(v, 3)}, expected value: {exp_v}')
   
   if __name__ == '__main__':
       kv = mx.kv.create('dist_sync')
       trainer_test(kvstore=kv, problem_descr={'lr': 0.1, 'init': 1., 'x': 2, 'target': 4})
       trainer_test(kvstore=kv, problem_descr={'lr': 0.1, 'init': 3., 'x': 2, 'target': 4})
   
   ```
   The trainer_test method tests a single step update on a simplified regression problem. It takes as inputs
   - a distributed kvstore, and
   - a problem description dict that creates a new regression problem to optimize. 
   
   ## Steps to reproduce
   
   Execute the script by using mxnet's `launch.py` tool with 2 or more workers and a `local` launcher as follows:
   
   ```
   launch.py -n 2 --launcher local python3 trainer_test.py
   ```
   The script execution freezes after evaluating the first `trainer_test` call. 
   
   ## What have you tried to solve it?
   
   Replacing the kvstore keys in the trainer to use unique parameter names (for example, `param.name`) solves this issue.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services