You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by GitBox <gi...@apache.org> on 2020/06/01 15:49:59 UTC
[GitHub] [singa] chrishkchris commented on pull request #697: New Model Layer Operator API
chrishkchris commented on pull request #697:
URL: https://github.com/apache/singa/pull/697#issuecomment-636934186
1. Distributed Training seems to be okay with cnn mnist:
```
root@221941d9ee7f:~/dcsysh/singa/examples/cnn# python3 train.py cnn mnist
Starting Epoch 0:
Training loss = 578.907959, training accuracy = 0.796141
Evaluation accuracy = 0.937400, Elapsed Time = 3.896953s
Starting Epoch 1:
Training loss = 232.124695, training accuracy = 0.922609
Evaluation accuracy = 0.962841, Elapsed Time = 3.881301s
Starting Epoch 2:
Training loss = 167.437912, training accuracy = 0.944220
Evaluation accuracy = 0.971855, Elapsed Time = 3.892491s
Starting Epoch 3:
Training loss = 138.634125, training accuracy = 0.953392
Evaluation accuracy = 0.966747, Elapsed Time = 3.899462s
Starting Epoch 4:
Training loss = 117.458504, training accuracy = 0.961096
Evaluation accuracy = 0.973057, Elapsed Time = 3.890943s
Starting Epoch 5:
Training loss = 104.992790, training accuracy = 0.965198
Evaluation accuracy = 0.979267, Elapsed Time = 3.887677s
Starting Epoch 6:
Training loss = 96.263885, training accuracy = 0.967249
Evaluation accuracy = 0.980369, Elapsed Time = 3.889482s
Starting Epoch 7:
Training loss = 89.073364, training accuracy = 0.970051
Evaluation accuracy = 0.975561, Elapsed Time = 3.889685s
Starting Epoch 8:
Training loss = 82.311523, training accuracy = 0.972385
Evaluation accuracy = 0.980369, Elapsed Time = 3.890139s
Starting Epoch 9:
Training loss = 78.408806, training accuracy = 0.974270
Evaluation accuracy = 0.979968, Elapsed Time = 3.887619s
root@221941d9ee7f:~/dcsysh/singa/examples/cnn# python3 train_multiprocess.py cnn mnist --ws 8 --lr 0.04
Starting Epoch 0:
Training loss = 822.958374, training accuracy = 0.704260
Evaluation accuracy = 0.920539, Elapsed Time = 0.963905s
Starting Epoch 1:
Training loss = 252.928589, training accuracy = 0.914830
Evaluation accuracy = 0.959396, Elapsed Time = 0.795974s
Starting Epoch 2:
Training loss = 173.046478, training accuracy = 0.942291
Evaluation accuracy = 0.961246, Elapsed Time = 0.759859s
Starting Epoch 3:
Training loss = 139.098495, training accuracy = 0.953309
Evaluation accuracy = 0.971834, Elapsed Time = 0.755693s
Starting Epoch 4:
Training loss = 119.849213, training accuracy = 0.960270
Evaluation accuracy = 0.976974, Elapsed Time = 0.732471s
Starting Epoch 5:
Training loss = 104.531982, training accuracy = 0.965595
Evaluation accuracy = 0.976151, Elapsed Time = 0.737399s
Starting Epoch 6:
Training loss = 97.911720, training accuracy = 0.967698
Evaluation accuracy = 0.976768, Elapsed Time = 0.752657s
Starting Epoch 7:
Training loss = 86.860199, training accuracy = 0.970019
Evaluation accuracy = 0.976768, Elapsed Time = 0.787210s
Starting Epoch 8:
Training loss = 79.776062, training accuracy = 0.973641
Evaluation accuracy = 0.980572, Elapsed Time = 0.755043s
Starting Epoch 9:
Training loss = 79.904083, training accuracy = 0.973741
Evaluation accuracy = 0.980469, Elapsed Time = 0.762142s
root@221941d9ee7f:~/dcsysh/singa/examples/cnn# mpiexec -np 8 python3 train_mpi.py cnn mnist --lr 0.04
Starting Epoch 0:
Training loss = 822.958374, training accuracy = 0.704260
Evaluation accuracy = 0.920539, Elapsed Time = 0.724138s
Starting Epoch 1:
Training loss = 252.928589, training accuracy = 0.914830
Evaluation accuracy = 0.959396, Elapsed Time = 0.668760s
Starting Epoch 2:
Training loss = 173.046478, training accuracy = 0.942291
Evaluation accuracy = 0.961246, Elapsed Time = 0.664062s
Starting Epoch 3:
Training loss = 139.098495, training accuracy = 0.953309
Evaluation accuracy = 0.971834, Elapsed Time = 0.672895s
Starting Epoch 4:
Training loss = 119.849213, training accuracy = 0.960270
Evaluation accuracy = 0.976974, Elapsed Time = 0.673973s
Starting Epoch 5:
Training loss = 104.531982, training accuracy = 0.965595
Evaluation accuracy = 0.976151, Elapsed Time = 0.673889s
Starting Epoch 6:
Training loss = 97.911720, training accuracy = 0.967698
Evaluation accuracy = 0.976768, Elapsed Time = 0.688231s
Starting Epoch 7:
Training loss = 86.860199, training accuracy = 0.970019
Evaluation accuracy = 0.976768, Elapsed Time = 0.703752s
Starting Epoch 8:
Training loss = 79.776062, training accuracy = 0.973641
Evaluation accuracy = 0.980572, Elapsed Time = 0.687812s
Starting Epoch 9:
Training loss = 79.904083, training accuracy = 0.973741
Evaluation accuracy = 0.980469, Elapsed Time = 0.698002s
```
2. However, when I run benchmark.py, it returns error?
```
root@221941d9ee7f:~/dcsysh/singa/examples/cnn# python3 benchmark.py
0%| | 0/100 [00:00<?, ?it/s]
Traceback (most recent call last):
File "benchmark.py", line 111, in <module>
train_resnet(DIST=args.DIST, graph=args.graph)
File "benchmark.py", line 78, in train_resnet
out = model(tx)
File "/root/dcsysh/singa/build/python/singa/model.py", line 203, in __call__
return self.train_one_batch(*input, **kwargs)
File "/root/dcsysh/singa/build/python/singa/model.py", line 49, in wrapper
self._results = func(self, *args, **kwargs)
TypeError: train_one_batch() missing 3 required positional arguments: 'y', 'dist_option', and 'spars'
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org