You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ssaavedra <gi...@git.apache.org> on 2018/09/11 08:31:46 UTC

[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

GitHub user ssaavedra opened a pull request:

    https://github.com/apache/spark/pull/22392

    [SPARK-23200] Reset Kubernetes-specific config on Checkpoint restore

    Several configuration parameters related to Kubernetes need to be
    reset, as they are changed with each invokation of spark-submit and
    thus prevents recovery of Spark Streaming tasks.
    
    ## What changes were proposed in this pull request?
    
    When using the Kubernetes cluster-manager and spawning a Streaming workload, it is important to reset many spark.kubernetes.* properties that are generated by spark-submit but which would get rewritten when restoring a Checkpoint. This is so, because the spark-submit codepath creates Kubernetes resources, such as a ConfigMap, a Secret and other variables, which have an autogenerated name and the previous one will not resolve anymore.
    
    In short, this change enables checkpoint restoration for streaming workloads, and thus enables Spark Streaming workloads in Kubernetes, which were not possible to restore from a checkpoint before if the workload went down.
    
    ## How was this patch tested?
    
    This patch needs would benefit from testing in different k8s clusters.
    
    This is similar to the YARN related code for resetting a Spark Streaming workload, but for the Kubernetes scheduler. This PR removes the initcontainers properties that existed before because they are now removed in master.
    
    For a previous discussion, see the non-rebased work at: apache-spark-on-k8s#516

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ssaavedra/spark fix-checkpointing-master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22392
    
----
commit f457d023e8e488b89b97f5b3b9936831d7fc9bb6
Author: Santiago Saavedra <sa...@...>
Date:   2018-09-11T07:58:40Z

    Reset Kubernetes-specific config on Checkpoint restore
    
    Several configuration parameters related to Kubernetes need to be
    reset, as they are changed with each invokation of spark-submit and
    thus prevents recovery of Spark Streaming tasks.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by liyinan926 <gi...@git.apache.org>.
Github user liyinan926 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r218616946
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    Yep, I can merge this later today if I don't see more comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3099/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Kubernetes integration test status success
    URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/3099/



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Kubernetes integration test starting
    URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/3099/



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by liyinan926 <gi...@git.apache.org>.
Github user liyinan926 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r217792152
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    Why including the `*.limit.cores` ones?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    @ssaavedra just check this with streaming it works fine. I used the following code jus tin case you want to repeat the test:
    https://gist.github.com/skonto/12217e93e7e9272365eb978025d42a4c
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96055/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22392


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by ssaavedra <gi...@git.apache.org>.
Github user ssaavedra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r218616371
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    Should we mark this as resolved and continue on the spark-dev about what to think about adding the rest of the kubernetes.* variables?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r218002553
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    Are there any other properties that need to be restored beyond that?
    How about `spark.kubernetes.allocation.batch.size` ,
    `spark.kubernetes.executor.lostCheck.maxAttempts`
    `spark.kubernetes.report.interval`
    `spark.kubernetes.executor.eventProcessingInterval`?
    Also there are several other properties for Python and R, listed in the [config](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    **[Test build #96055 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96055/testReport)** for PR 22392 at commit [`f457d02`](https://github.com/apache/spark/commit/f457d023e8e488b89b97f5b3b9936831d7fc9bb6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by ssaavedra <gi...@git.apache.org>.
Github user ssaavedra commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    @skonto this seems to be related to how the Checkpointing code works in general. Variables that are not explicitly whitelisted as per the list I'm changing in this proposal seems to be reloaded from the checklpointed data. I'm not completely sure that it should be so hard to override the variables on job restart, and I vaguely remember that there was an issue at JIRA for changing more broadly how the Checkpointing code worked, but I can't find it right now.
    
    @liyinan926 that was fast :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by ssaavedra <gi...@git.apache.org>.
Github user ssaavedra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r218231743
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    I'll try to get this updated by tomorrow, sorry for the delay today


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by liyinan926 <gi...@git.apache.org>.
Github user liyinan926 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r217838764
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    If that's the reasoning, you probably also want to restore all the resource request/limit settings.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by liyinan926 <gi...@git.apache.org>.
Github user liyinan926 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r218178934
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    I'm fine deferring consideration for the other properties to a later time. But given that the the core limit ones are included, it makes no sense to exclude the core request ones. If the purpose is to make it work first for 2.4, I would suggest just keeping the ones that are not deterministic, e.g., driver pod name and executor pod name prefix. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    @liyinan926 @ssaavedra  one thing I just noticed is  that if you restart the driver and initially you didnt set spark.executorEnv. then if you set it when you restart the job, it has no effect.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by ssaavedra <gi...@git.apache.org>.
Github user ssaavedra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r217821786
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    Because, in my opinion, the previous checkpointed execution should not impact a newer choice of limit cores, nor should impose its old defaults on the newly-spawned executor pods.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by ssaavedra <gi...@git.apache.org>.
Github user ssaavedra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r218048682
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    I'm not sure about the use-case there. I think I agree in general with adding all of the request/limit knobs, because the cluster may have changed (e.g., a "core" is now a latest-gen Xeon instead of a prev-gen Celeron), or the job might have starved due to resource limits. However, I'm not sure that the general job settings should need tweaking when reloading a job from a checkpoint.
    
    My rationale is that this should only happen when a job would have been, otherwise, just been running, with no opportunity for you to tweak such settings. I'd argue that if you need that fine-tuning, you should want to perform a new deployment of your job, instead of re-launching the previous one. You need to fix variables that refer to those non-deterministic choices that the deployment process does (such as choosing Pod names, IP addresses and so on), but I'd say the rest of the config flags should be unaffected.
    
    In particular, about the python and R-related configurations, none of those should need to be changed in this case, since that would mean you are actually changing the operations you are going to perform, or the Python version.
    
    I can add the missing resource request/limit flags or remove them completely, I'm not sure what's the better approach. In any case, I think a further discussion in spark-dev should settle this mist, but if you want checkpointing to be ready for 2.4.0 (we are already late), I'd go with what's needed for it to work and follow-up later with more advanced use cases, but I'm open to alternatives. Also, for those flags that are not really cluster-configuration-related, being able to change those flags falls out of scope here, and instead that should lead to a discussion about whether restoring a job from a checkpoint should allow such job to carry different run-time semantics.
    
    Looking forward to a solution :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by foxish <gi...@git.apache.org>.
Github user foxish commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    Jenkins, test this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22392: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22392
  
    **[Test build #96055 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96055/testReport)** for PR 22392 at commit [`f457d02`](https://github.com/apache/spark/commit/f457d023e8e488b89b97f5b3b9936831d7fc9bb6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r218056207
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    I agree with getting it to work first. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org