You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by GitBox <gi...@apache.org> on 2020/06/01 15:49:59 UTC

[GitHub] [singa] chrishkchris commented on pull request #697: New Model Layer Operator API

chrishkchris commented on pull request #697:
URL: https://github.com/apache/singa/pull/697#issuecomment-636934186


   1. Distributed Training seems to be okay with cnn mnist:
   
   ```
   root@221941d9ee7f:~/dcsysh/singa/examples/cnn# python3 train.py cnn mnist
   Starting Epoch 0:
   Training loss = 578.907959, training accuracy = 0.796141
   Evaluation accuracy = 0.937400, Elapsed Time = 3.896953s
   Starting Epoch 1:
   Training loss = 232.124695, training accuracy = 0.922609
   Evaluation accuracy = 0.962841, Elapsed Time = 3.881301s
   Starting Epoch 2:
   Training loss = 167.437912, training accuracy = 0.944220
   Evaluation accuracy = 0.971855, Elapsed Time = 3.892491s
   Starting Epoch 3:
   Training loss = 138.634125, training accuracy = 0.953392
   Evaluation accuracy = 0.966747, Elapsed Time = 3.899462s
   Starting Epoch 4:
   Training loss = 117.458504, training accuracy = 0.961096
   Evaluation accuracy = 0.973057, Elapsed Time = 3.890943s
   Starting Epoch 5:
   Training loss = 104.992790, training accuracy = 0.965198
   Evaluation accuracy = 0.979267, Elapsed Time = 3.887677s
   Starting Epoch 6:
   Training loss = 96.263885, training accuracy = 0.967249
   Evaluation accuracy = 0.980369, Elapsed Time = 3.889482s
   Starting Epoch 7:
   Training loss = 89.073364, training accuracy = 0.970051
   Evaluation accuracy = 0.975561, Elapsed Time = 3.889685s
   Starting Epoch 8:
   Training loss = 82.311523, training accuracy = 0.972385
   Evaluation accuracy = 0.980369, Elapsed Time = 3.890139s
   Starting Epoch 9:
   Training loss = 78.408806, training accuracy = 0.974270
   Evaluation accuracy = 0.979968, Elapsed Time = 3.887619s
   root@221941d9ee7f:~/dcsysh/singa/examples/cnn# python3 train_multiprocess.py cnn mnist --ws 8 --lr 0.04
   Starting Epoch 0:
   Training loss = 822.958374, training accuracy = 0.704260
   Evaluation accuracy = 0.920539, Elapsed Time = 0.963905s
   Starting Epoch 1:
   Training loss = 252.928589, training accuracy = 0.914830
   Evaluation accuracy = 0.959396, Elapsed Time = 0.795974s
   Starting Epoch 2:
   Training loss = 173.046478, training accuracy = 0.942291
   Evaluation accuracy = 0.961246, Elapsed Time = 0.759859s
   Starting Epoch 3:
   Training loss = 139.098495, training accuracy = 0.953309
   Evaluation accuracy = 0.971834, Elapsed Time = 0.755693s
   Starting Epoch 4:
   Training loss = 119.849213, training accuracy = 0.960270
   Evaluation accuracy = 0.976974, Elapsed Time = 0.732471s
   Starting Epoch 5:
   Training loss = 104.531982, training accuracy = 0.965595
   Evaluation accuracy = 0.976151, Elapsed Time = 0.737399s
   Starting Epoch 6:
   Training loss = 97.911720, training accuracy = 0.967698
   Evaluation accuracy = 0.976768, Elapsed Time = 0.752657s
   Starting Epoch 7:
   Training loss = 86.860199, training accuracy = 0.970019
   Evaluation accuracy = 0.976768, Elapsed Time = 0.787210s
   Starting Epoch 8:
   Training loss = 79.776062, training accuracy = 0.973641
   Evaluation accuracy = 0.980572, Elapsed Time = 0.755043s
   Starting Epoch 9:
   Training loss = 79.904083, training accuracy = 0.973741
   Evaluation accuracy = 0.980469, Elapsed Time = 0.762142s
   root@221941d9ee7f:~/dcsysh/singa/examples/cnn# mpiexec -np 8 python3 train_mpi.py cnn mnist --lr 0.04
   Starting Epoch 0:
   Training loss = 822.958374, training accuracy = 0.704260
   Evaluation accuracy = 0.920539, Elapsed Time = 0.724138s
   Starting Epoch 1:
   Training loss = 252.928589, training accuracy = 0.914830
   Evaluation accuracy = 0.959396, Elapsed Time = 0.668760s
   Starting Epoch 2:
   Training loss = 173.046478, training accuracy = 0.942291
   Evaluation accuracy = 0.961246, Elapsed Time = 0.664062s
   Starting Epoch 3:
   Training loss = 139.098495, training accuracy = 0.953309
   Evaluation accuracy = 0.971834, Elapsed Time = 0.672895s
   Starting Epoch 4:
   Training loss = 119.849213, training accuracy = 0.960270
   Evaluation accuracy = 0.976974, Elapsed Time = 0.673973s
   Starting Epoch 5:
   Training loss = 104.531982, training accuracy = 0.965595
   Evaluation accuracy = 0.976151, Elapsed Time = 0.673889s
   Starting Epoch 6:
   Training loss = 97.911720, training accuracy = 0.967698
   Evaluation accuracy = 0.976768, Elapsed Time = 0.688231s
   Starting Epoch 7:
   Training loss = 86.860199, training accuracy = 0.970019
   Evaluation accuracy = 0.976768, Elapsed Time = 0.703752s
   Starting Epoch 8:
   Training loss = 79.776062, training accuracy = 0.973641
   Evaluation accuracy = 0.980572, Elapsed Time = 0.687812s
   Starting Epoch 9:
   Training loss = 79.904083, training accuracy = 0.973741
   Evaluation accuracy = 0.980469, Elapsed Time = 0.698002s
   ```
   
   2. However, when I run benchmark.py, it returns error?
   ```
   root@221941d9ee7f:~/dcsysh/singa/examples/cnn# python3 benchmark.py
     0%|                                                                                                              | 0/100 [00:00<?, ?it/s]
   Traceback (most recent call last):
     File "benchmark.py", line 111, in <module>
       train_resnet(DIST=args.DIST, graph=args.graph)
     File "benchmark.py", line 78, in train_resnet
       out = model(tx)
     File "/root/dcsysh/singa/build/python/singa/model.py", line 203, in __call__
       return self.train_one_batch(*input, **kwargs)
     File "/root/dcsysh/singa/build/python/singa/model.py", line 49, in wrapper
       self._results = func(self, *args, **kwargs)
   TypeError: train_one_batch() missing 3 required positional arguments: 'y', 'dist_option', and 'spars'
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org