You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by GitBox <gi...@apache.org> on 2020/07/09 09:38:37 UTC

[GitHub] [singa] chrishkchris opened a new pull request #762: Fix training loss error

chrishkchris opened a new pull request #762:
URL: https://github.com/apache/singa/pull/762


   A fix of error in training loss, the expected loss I used for long time is appeared wrong in the dev branch in distributed training,
   
   Before fix:
   ```
   root@64926e30597f:~/dcsysh/singa/examples/cnn# mpiexec -np 3 python3 train_mpi.py cnn mnist -l 0.015
   Starting Epoch 0:
   Training loss = 867.269531, training accuracy = 0.682409
   Evaluation accuracy = 0.913662, Elapsed Time = 1.374367s
   Starting Epoch 1:
   Training loss = 312.582123, training accuracy = 0.893546
   Evaluation accuracy = 0.946014, Elapsed Time = 1.324747s
   Starting Epoch 2:
   Training loss = 223.973038, training accuracy = 0.924312
   Evaluation accuracy = 0.955629, Elapsed Time = 1.325152s
   Starting Epoch 3:
   Training loss = 176.310730, training accuracy = 0.939804
   Evaluation accuracy = 0.965645, Elapsed Time = 1.327019s
   Starting Epoch 4:
   Training loss = 146.806168, training accuracy = 0.950220
   Evaluation accuracy = 0.969451, Elapsed Time = 1.320603s
   Starting Epoch 5:
   Training loss = 124.658463, training accuracy = 0.958784
   Evaluation accuracy = 0.970653, Elapsed Time = 1.317975s
   Starting Epoch 6:
   Training loss = 112.322250, training accuracy = 0.962724
   Evaluation accuracy = 0.972857, Elapsed Time = 1.343767s
   Starting Epoch 7:
   Training loss = 102.903122, training accuracy = 0.965044
   Evaluation accuracy = 0.971254, Elapsed Time = 1.316032s
   Starting Epoch 8:
   Training loss = 96.206215, training accuracy = 0.967798
   Evaluation accuracy = 0.971354, Elapsed Time = 1.292748s
   Starting Epoch 9:
   Training loss = 90.059357, training accuracy = 0.969785
   Evaluation accuracy = 0.981170, Elapsed Time = 1.301958s
   ```
   
   After fix:
   root@64926e30597f:~/dcsysh/singa/examples/cnn# mpiexec -np 3 python3 train_mpi.py cnn mnist -l 0.015
   ```
   Starting Epoch 0:
   Training loss = 653.234863, training accuracy = 0.767194
   Evaluation accuracy = 0.936498, Elapsed Time = 1.364626s
   Starting Epoch 1:
   Training loss = 245.488037, training accuracy = 0.917201
   Evaluation accuracy = 0.959435, Elapsed Time = 1.311175s
   Starting Epoch 2:
   Training loss = 174.001266, training accuracy = 0.941757
   Evaluation accuracy = 0.959736, Elapsed Time = 1.324813s
   Starting Epoch 3:
   Training loss = 141.203125, training accuracy = 0.953292
   Evaluation accuracy = 0.971054, Elapsed Time = 1.330215s
   Starting Epoch 4:
   Training loss = 119.192688, training accuracy = 0.959519
   Evaluation accuracy = 0.973758, Elapsed Time = 1.302892s
   Starting Epoch 5:
   Training loss = 107.171661, training accuracy = 0.964443
   Evaluation accuracy = 0.975761, Elapsed Time = 1.314337s
   Starting Epoch 6:
   Training loss = 97.575897, training accuracy = 0.966513
   Evaluation accuracy = 0.977764, Elapsed Time = 1.304296s
   Starting Epoch 7:
   Training loss = 89.828827, training accuracy = 0.970753
   Evaluation accuracy = 0.975561, Elapsed Time = 1.316111s
   Starting Epoch 8:
   Training loss = 84.263199, training accuracy = 0.972189
   Evaluation accuracy = 0.979868, Elapsed Time = 1.298452s
   Starting Epoch 9:
   Training loss = 78.318733, training accuracy = 0.974059
   Evaluation accuracy = 0.981370, Elapsed Time = 1.308062s
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] dcslin merged pull request #762: Fix training loss error

Posted by GitBox <gi...@apache.org>.
dcslin merged pull request #762:
URL: https://github.com/apache/singa/pull/762


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] dcslin commented on pull request #762: Fix training loss error

Posted by GitBox <gi...@apache.org>.
dcslin commented on pull request #762:
URL: https://github.com/apache/singa/pull/762#issuecomment-656140942


   yes reviewed softmax cross entropy, it automatically detect if it is "from_logits". this look ok to me


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org