You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mridulm <gi...@git.apache.org> on 2018/10/02 08:06:51 UTC

[GitHub] spark pull request #22609: [SPARK-25594] [Core] Avoid maintaining task infor...

GitHub user mridulm opened a pull request:

    https://github.com/apache/spark/pull/22609

    [SPARK-25594] [Core] Avoid maintaining task information when UI is disabled

    ## What changes were proposed in this pull request?
    
    Avoid maintaining task information in live spark application when UI is disabled.
    For long running application, with large number of tasks, this ended up causing OOM in our tests.
    
    ## How was this patch tested?
    
    Long running test successfully ran for 34 hours with steady memory, when it used to fail in 28 hours with OOM earlier.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mridulm/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22609.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22609
    
----
commit c2f6b4a50ef3693292f600a2d4d7743ea870b96e
Author: Mridul Muralidharan <mr...@...>
Date:   2018-10-02T08:03:41Z

    SPARK-25594: Avoid maintaining task information when UI is disabled

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    Hi @mridulm ,
    
    it seems that the changes will affect the metrics in `SparkStatusTracker`, e.g.  number of active tasks in `getExecutorInfos` or `getStageInfo` .
    
    Maybe we should maintain number of active/total/complete/failed tasks as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    +CC @vanzin, @tgravescs 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    This is a regression introduced in 2.3; unfortunately we did not notice it
    until now.
    
    
    On Tue, Oct 2, 2018 at 4:56 AM Shahid <no...@github.com> wrote:
    
    > Hi @mridulm <https://github.com/mridulm> , Can't we limit the task
    > information by setting 'spark.ui.retainedTasks' lesser, to avoid OOM?
    > correct me if I am wrong.
    >
    > —
    > You are receiving this because you were mentioned.
    >
    >
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/22609#issuecomment-426245438>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/ABhJlCqwUw8g_7t-xnAHhBxFXc2jrcO_ks5ug1RfgaJpZM4XDhaB>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    I will close the PR and document in our tests to use spark.ui.retainedTasks = 1 (or some very low value) : which will, in effect, be equivalent to this PR - and loose information anyway; but will be specific to the application which choose to do it (and which does not care about the api).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    Task events still might need to be processed - even for live applications with the UI disabled. See the API that has been mentioned before (`SparkStatusTracker`) for why. Processing task events is different from keeping information about individual tasks, though.
    
    I agree it's a regression - but, at the same time, if you set the "retainedTasks" limit to a low value it should at least help a bit.
    
    > Any implications of making it disk backed ?
    
    I tried that a lot while writing this code, but the disk store is just too slow for when you have bursts of lots of events. Even with a lot of improvements I made to this code, I still got dropped events with the disk store, or high memory usage if I tried to do fancy things like using a separate writer thread to unblock the listener bus...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3619/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    **[Test build #96846 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96846/testReport)** for PR 22609 at commit [`c2f6b4a`](https://github.com/apache/spark/commit/c2f6b4a50ef3693292f600a2d4d7743ea870b96e).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96846/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    @vanzin , @gengliangwang To clarify earlier behavior - if UI is disabled, any task related metrics were reported as zero: granularity of update was at the stage level.
    UI is typically disabled when application does not want additional overheads associated with it (users leverage history server to see the application state) : particularly in multi-tenant settings (like thrift server) or very large applications (processing PB's of data).
    If UI is enabled (default) or if application is not `live` (history server), task events would be processed.
    
    Unfortunately, application + config, which used to work before 2.3 are now throwing OOM (typically, the long running tests are run for 7+ days to find any kerberos related issues - here a LR stress test failed within 28 hours) - and large part of the heap was occupied by TaskDataWrapper; making this a regression.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    @gengliangwang is right; there are still things that need to be done when the UI is disabled; job, stage and executor info are all still exposed even if the UI is disabled, and the task even callbacks update those.
    
    It might be as easy as not adding the task to `liveTasks` in `onTaskStart`, but haven't really looked in detail.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    @vanzin any reason why liveStore is hardcoded to be in-memory ? Any implications of making it disk backed ? That might be another option to unblock - requires a config change to existing applications; but way better than OOM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    **[Test build #96846 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96846/testReport)** for PR 22609 at commit [`c2f6b4a`](https://github.com/apache/spark/commit/c2f6b4a50ef3693292f600a2d4d7743ea870b96e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by shahidki31 <gi...@git.apache.org>.
Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    Hi @mridulm , Can't we limit the task information by setting 'spark.ui.retainedTasks' lesser, to avoid OOM? correct me if I am wrong.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22609: [SPARK-25594] [Core] Avoid maintaining task infor...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm closed the pull request at:

    https://github.com/apache/spark/pull/22609


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22609
  
    
    
    > I tried that a lot while writing this code, but the disk store is just too slow
    
    That is unfortunate ... I was actually hoping that would be a good compromise if expanded api needs to be preserved.
    
    >  if UI is disabled, any task related metrics were reported as zero:
    
    I was incorrect in this regard - they were not zero, but JobProgressListener had stage/job counters which maintained this.
    This was limited to job and stage level though - keeping the actual overhead fairly minimal and honoring spark.ui.retainedStages/spark.ui.retainedJobs.
    
    
    Given this, current PR would be a functionality regression - since we end up loosing information we would otherwise expose.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org