You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@submarine.apache.org by GitBox <gi...@apache.org> on 2021/08/03 15:39:15 UTC

[GitHub] [submarine] FatalLin opened a new pull request #694: SUBMARINE-952. add backoffLimit and Failed status for experiment object

FatalLin opened a new pull request #694:
URL: https://github.com/apache/submarine/pull/694


   ### What is this PR for?
   just like we mentioned in Jira ticket, for now submarine will retry those retry able jobs endlessly even those job never had a chance to success. It's waste of resource obviously, so I add a MLJob property BackoffLimit to prevent this kind of situation, at same time I change the MLJobSpec from interface into abstract class to share property with TFJobSpec and PytorchJobSpec.
   I also fixed a bug to respond the correct status of experiment in failure case. 
   
   ### What type of PR is it?
   Improvement
   
   ### Todos
   N/A
   
   ### What is the Jira issue?
   https://issues.apache.org/jira/browse/SUBMARINE-952
   
   ### How should this be tested?
   modify the test case (https://github.com/apache/submarine/blob/master/submarine-test/test-e2e/src/test/java/org/apache/submarine/integration/experimentIT.java#L90) from {1024, 1024} to {512, 512},
   and the experiment will hit OOMFailure, and the experiment status will change into failed after retry 3 times.
   ### Screenshots (if appropriate)
   <img width="1380" alt="截圖 2021-08-01 下午5 03 43" src="https://user-images.githubusercontent.com/5687317/128044592-e2cee95c-2ee9-4702-88ff-d41950e003ec.png">
   <img width="1394" alt="截圖 2021-08-03 下午11 10 39" src="https://user-images.githubusercontent.com/5687317/128044618-454afdb5-c1b8-4395-a75e-f470a7c41625.png">
   
   ### Questions:
   * Do the license files need updating? No
   * Are there breaking changes for older versions? No
   * Does this need new documentation? No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] asfgit closed pull request #694: SUBMARINE-952. add backoffLimit and Failed status for experiment object

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #694:
URL: https://github.com/apache/submarine/pull/694


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org