You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@submarine.apache.org by pi...@apache.org on 2021/08/07 04:36:56 UTC

[submarine] branch master updated: SUBMARINE-948. Allow experiments to overcommit memory

This is an automated email from the ASF dual-hosted git repository.

pingsutw pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/submarine.git


The following commit(s) were added to refs/heads/master by this push:
     new 39b65f1  SUBMARINE-948. Allow experiments to overcommit memory
39b65f1 is described below

commit 39b65f1f9aad9a6ff25fc14506267a7b7fcb6b8e
Author: Kai-Hsun Chen <b0...@ntu.edu.tw>
AuthorDate: Fri Aug 6 15:01:41 2021 +0800

    SUBMARINE-948. Allow experiments to overcommit memory
    
    ### What is this PR for?
    The following two pull requests aim to resolve the Out-Of-Memory error. However, it is very inconvenient for users to predict the actual memory usage. Thus, using the memory request and memory limit mechanism to allow overcommitment of memory is helpful for users.
    
    * https://github.com/apache/submarine/pull/621
    * https://github.com/apache/submarine/pull/510
    
    In this PR, I set the memory limit to twice the memory request to enable overcommitment of memory. With this patch, the OOM errors can be reduced effectively.
    
    This [article](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits) is a good resource to better understand this PR.
    
    ### What type of PR is it?
    [Feature]
    
    ### Todos
    
    ### What is the Jira issue?
    https://issues.apache.org/jira/browse/SUBMARINE-948
    
    ### How should this be tested?
    **Test1**
    * Create a distributed TensorFlow MNIST job, and set the memory quota of a worker to 512 MB. To elaborate, modify [experimentIT.java:90](https://github.com/apache/submarine/blob/master/submarine-test/test-e2e/src/test/java/org/apache/submarine/integration/experimentIT.java#L90) to
      ```java
      experimentPage.fillTfSpec(2, new String[]{"Ps", "Worker"}, new int[]{1, 1}, new int[]{1, 1}, new int[]{512, 512});
      ```
    * Without this PR, this MNIST job will be killed due to an Out-Of-Memory error. On the other hand, with this PR, the MNIST job will not be killed.
    
    **Test2**
    * The memory limit is equal to twice the memory request.
        ```
        kubectl describe ${your_experiment_pod}
        ```
        <img width="422" alt="截圖 2021-08-06 下午2 40 42" src="https://user-images.githubusercontent.com/20109646/128474314-bcfc0067-a841-4bdb-8ce2-4014849ffd57.png">
    
    ### Screenshots (if appropriate)
    * Kubernetes integration test on my local machine
    <img width="1393" alt="截圖 2021-08-06 下午3 53 50" src="https://user-images.githubusercontent.com/20109646/128476758-1918d7e3-d17c-4c37-b1d1-33baee71488b.png">
    
    ### Questions:
    * Do the license files need updating? No
    * Are there breaking changes for older versions? No
    * Does this need new documentation? No
    
    Author: Kai-Hsun Chen <b0...@ntu.edu.tw>
    
    Signed-off-by: Kevin <pi...@apache.org>
    
    Closes #699 from kevin85421/SUBMARINE-948 and squashes the following commits:
    
    0043baa4 [Kai-Hsun Chen] SUBMARINE-948. Allow experiments to overcommit memory
---
 .../submitter/k8s/parser/ExperimentSpecParser.java    | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/submarine-server/server-submitter/submitter-k8s/src/main/java/org/apache/submarine/server/submitter/k8s/parser/ExperimentSpecParser.java b/submarine-server/server-submitter/submitter-k8s/src/main/java/org/apache/submarine/server/submitter/k8s/parser/ExperimentSpecParser.java
index 2bc95bb..ece4be7 100644
--- a/submarine-server/server-submitter/submitter-k8s/src/main/java/org/apache/submarine/server/submitter/k8s/parser/ExperimentSpecParser.java
+++ b/submarine-server/server-submitter/submitter-k8s/src/main/java/org/apache/submarine/server/submitter/k8s/parser/ExperimentSpecParser.java
@@ -61,9 +61,7 @@ import java.util.List;
 import java.util.Map;
 
 public class ExperimentSpecParser {
-
-  private static SubmarineConfiguration conf =
-      SubmarineConfiguration.getInstance();
+  private static SubmarineConfiguration conf = SubmarineConfiguration.getInstance();
 
   public static MLJob parseJob(ExperimentSpec experimentSpec) throws InvalidSpecException {
     String framework = experimentSpec.getMeta().getFramework();
@@ -176,7 +174,8 @@ public class ExperimentSpecParser {
     }
     // resources
     V1ResourceRequirements resources = new V1ResourceRequirements();
-    resources.setLimits(parseResources(taskSpec));
+    resources.setRequests(parseResources(taskSpec, true));
+    resources.setLimits(parseResources(taskSpec, false));
     container.setResources(resources);
     container.setEnv(parseEnvVars(taskSpec, experimentSpec.getMeta().getEnvVars()));
 
@@ -343,14 +342,22 @@ public class ExperimentSpecParser {
     return envVars;
   }
 
-  private static Map<String, Quantity> parseResources(ExperimentTaskSpec taskSpec) {
+  private static Map<String, Quantity> parseResources(ExperimentTaskSpec taskSpec, boolean request) {
     Map<String, Quantity> resources = new HashMap<>();
     taskSpec.setResources(taskSpec.getResources());
     if (taskSpec.getCpu() != null) {
       resources.put("cpu", new Quantity(taskSpec.getCpu()));
     }
     if (taskSpec.getMemory() != null) {
-      resources.put("memory", new Quantity(taskSpec.getMemory()));
+      String memoryRequest = taskSpec.getMemory();
+      if (request) {
+        resources.put("memory", new Quantity(memoryRequest)); // ex: 1024M
+      } else {
+        String suffix = memoryRequest.substring(memoryRequest.length() - 1);
+        String value = memoryRequest.substring(0, memoryRequest.length() - 1);
+        String memoryLimit = String.valueOf(Integer.parseInt(value) * 2) + suffix;
+        resources.put("memory", new Quantity(memoryLimit));
+      }
     }
     if (taskSpec.getGpu() != null) {
       resources.put("nvidia.com/gpu", new Quantity(taskSpec.getGpu()));

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@submarine.apache.org
For additional commands, e-mail: dev-help@submarine.apache.org