You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by GitBox <gi...@apache.org> on 2020/07/09 09:38:37 UTC
[GitHub] [singa] chrishkchris opened a new pull request #762: Fix training loss error
chrishkchris opened a new pull request #762:
URL: https://github.com/apache/singa/pull/762
A fix of error in training loss, the expected loss I used for long time is appeared wrong in the dev branch in distributed training,
Before fix:
```
root@64926e30597f:~/dcsysh/singa/examples/cnn# mpiexec -np 3 python3 train_mpi.py cnn mnist -l 0.015
Starting Epoch 0:
Training loss = 867.269531, training accuracy = 0.682409
Evaluation accuracy = 0.913662, Elapsed Time = 1.374367s
Starting Epoch 1:
Training loss = 312.582123, training accuracy = 0.893546
Evaluation accuracy = 0.946014, Elapsed Time = 1.324747s
Starting Epoch 2:
Training loss = 223.973038, training accuracy = 0.924312
Evaluation accuracy = 0.955629, Elapsed Time = 1.325152s
Starting Epoch 3:
Training loss = 176.310730, training accuracy = 0.939804
Evaluation accuracy = 0.965645, Elapsed Time = 1.327019s
Starting Epoch 4:
Training loss = 146.806168, training accuracy = 0.950220
Evaluation accuracy = 0.969451, Elapsed Time = 1.320603s
Starting Epoch 5:
Training loss = 124.658463, training accuracy = 0.958784
Evaluation accuracy = 0.970653, Elapsed Time = 1.317975s
Starting Epoch 6:
Training loss = 112.322250, training accuracy = 0.962724
Evaluation accuracy = 0.972857, Elapsed Time = 1.343767s
Starting Epoch 7:
Training loss = 102.903122, training accuracy = 0.965044
Evaluation accuracy = 0.971254, Elapsed Time = 1.316032s
Starting Epoch 8:
Training loss = 96.206215, training accuracy = 0.967798
Evaluation accuracy = 0.971354, Elapsed Time = 1.292748s
Starting Epoch 9:
Training loss = 90.059357, training accuracy = 0.969785
Evaluation accuracy = 0.981170, Elapsed Time = 1.301958s
```
After fix:
root@64926e30597f:~/dcsysh/singa/examples/cnn# mpiexec -np 3 python3 train_mpi.py cnn mnist -l 0.015
```
Starting Epoch 0:
Training loss = 653.234863, training accuracy = 0.767194
Evaluation accuracy = 0.936498, Elapsed Time = 1.364626s
Starting Epoch 1:
Training loss = 245.488037, training accuracy = 0.917201
Evaluation accuracy = 0.959435, Elapsed Time = 1.311175s
Starting Epoch 2:
Training loss = 174.001266, training accuracy = 0.941757
Evaluation accuracy = 0.959736, Elapsed Time = 1.324813s
Starting Epoch 3:
Training loss = 141.203125, training accuracy = 0.953292
Evaluation accuracy = 0.971054, Elapsed Time = 1.330215s
Starting Epoch 4:
Training loss = 119.192688, training accuracy = 0.959519
Evaluation accuracy = 0.973758, Elapsed Time = 1.302892s
Starting Epoch 5:
Training loss = 107.171661, training accuracy = 0.964443
Evaluation accuracy = 0.975761, Elapsed Time = 1.314337s
Starting Epoch 6:
Training loss = 97.575897, training accuracy = 0.966513
Evaluation accuracy = 0.977764, Elapsed Time = 1.304296s
Starting Epoch 7:
Training loss = 89.828827, training accuracy = 0.970753
Evaluation accuracy = 0.975561, Elapsed Time = 1.316111s
Starting Epoch 8:
Training loss = 84.263199, training accuracy = 0.972189
Evaluation accuracy = 0.979868, Elapsed Time = 1.298452s
Starting Epoch 9:
Training loss = 78.318733, training accuracy = 0.974059
Evaluation accuracy = 0.981370, Elapsed Time = 1.308062s
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [singa] dcslin merged pull request #762: Fix training loss error
Posted by GitBox <gi...@apache.org>.
dcslin merged pull request #762:
URL: https://github.com/apache/singa/pull/762
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [singa] dcslin commented on pull request #762: Fix training loss error
Posted by GitBox <gi...@apache.org>.
dcslin commented on pull request #762:
URL: https://github.com/apache/singa/pull/762#issuecomment-656140942
yes reviewed softmax cross entropy, it automatically detect if it is "from_logits". this look ok to me
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org