You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@submarine.apache.org by GitBox <gi...@apache.org> on 2021/08/06 07:40:26 UTC

[GitHub] [submarine] kevin85421 opened a new pull request #699: SUBMARINE-948. Allow experiments to overcommit memory

kevin85421 opened a new pull request #699:
URL: https://github.com/apache/submarine/pull/699


   ### What is this PR for?
   The following two pull requests aim to resolve the Out-Of-Memory error. However, it is very inconvenient for users to predict the actual memory usage. Thus, using the memory request and memory limit mechanism to allow overcommitment of memory is helpful for users.
   
   * https://github.com/apache/submarine/pull/621
   * https://github.com/apache/submarine/pull/510
   
   In this PR, I set the memory limit to twice the memory request to enable overcommitment of memory. With this patch, the OOM errors can be reduced effectively.
   
   This [article](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits) is a good resource to better understand this PR.
   
   ### What type of PR is it?
   [Feature]
   
   ### Todos
   
   
   ### What is the Jira issue?
   https://issues.apache.org/jira/browse/SUBMARINE-948
   
   ### How should this be tested?
   **Test1**
   * Create a distributed TensorFlow MNIST job, and set the memory quota of a worker to 512 MB. To elaborate, modify [experimentIT.java:90](https://github.com/apache/submarine/blob/master/submarine-test/test-e2e/src/test/java/org/apache/submarine/integration/experimentIT.java#L90) to 
     ```java
     experimentPage.fillTfSpec(2, new String[]{"Ps", "Worker"}, new int[]{1, 1}, new int[]{1, 1}, new int[]{512, 512});
     ```
   * Without this PR, this MNIST job will be killed due to an Out-Of-Memory error. On the other hand, with this PR, the MNIST job will not be killed.
   
   **Test2**
   ```
   kubectl describe ${your_experiment_pod}
   ```
   <img width="422" alt="ζˆͺεœ– 2021-08-06 δΈ‹εˆ2 40 42" src="https://user-images.githubusercontent.com/20109646/128474314-bcfc0067-a841-4bdb-8ce2-4014849ffd57.png">
   
   ### Screenshots (if appropriate)
   
   ### Questions:
   * Do the license files need updating? No
   * Are there breaking changes for older versions? No
   * Does this need new documentation? No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] asfgit closed pull request #699: SUBMARINE-948. Allow experiments to overcommit memory

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #699:
URL: https://github.com/apache/submarine/pull/699


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] kevin85421 commented on pull request #699: SUBMARINE-948. Allow experiments to overcommit memory

Posted by GitBox <gi...@apache.org>.
kevin85421 commented on pull request #699:
URL: https://github.com/apache/submarine/pull/699#issuecomment-894069575


   @pingsutw Can you help me review this PR? Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org