You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/06/06 14:37:02 UTC

[GitHub] spark pull request #21502: [SPARK-22575][SQL] Add destroy to Dataset

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/21502

    [SPARK-22575][SQL] Add destroy to Dataset

    ## What changes were proposed in this pull request?
    
    In the Dataset API we may acquire resources which we cannot deallocate. This happens for broadcast joins. The broadcasted object is never destroyed and we rely on the garbage collection of broadcasted object to free it. In a general use case, this is a safe assumption, but when dynamic allocation is enabled, the current approach can lead to resource leakage.
    
    In particular, when a Spark application is submitted on YARN with dynamic allocation enabled, we may leak disk space. Indeed, in such a scenario, when query with a broadcast join is executed, it is likely that we ask for new containers. These containers are used for the execution of the query and then killed. They may be killed before the broadcast object is GCed. In this case, the files which have been written are never removed (as the container is not alive anymore to remove them and YARN removes them only when the application ends).
    
    In order to solve the above-mentioned issue, the PR proposes to add a `destroy` method to the `Dataset` class, which can be used to free all the resources which have been acquired in the plan execution. Eagerly destroying the acquired resources, they are freed before the containers are killed, avoiding (or at least reducing considerably) the problem.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-22575

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21502.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21502
    
----
commit 147bd08db09fe328de12069c9c0d8a849d99adf4
Author: Marco Gaido <ma...@...>
Date:   2018-01-31T16:35:37Z

    [SPARK-22575][SQL] Add destroy to Dataset

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3824/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3832/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    IMO I like that approach. If this issue happens only in dynamic allocation, how about adding a new option to turn off/on that checking?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    **[Test build #91504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91504/testReport)** for PR 21502 at commit [`147bd08`](https://github.com/apache/spark/commit/147bd08db09fe328de12069c9c0d8a849d99adf4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91524/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    Well, it happens especially with dynamic allocation, but there may be other causes like YARN preemption. Anytime a container is killed we can face this issue. Anyway, I plan to check the feasibility of this other approach (it may take some time as I'm not very familiar with that part of the codebase).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91528/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    **[Test build #91528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91528/testReport)** for PR 21502 at commit [`4d080cf`](https://github.com/apache/spark/commit/4d080cff795457c6a02b255acc691157afc94e81).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    @maropu I think a monitor thread would be useless. Once the container is gone, there is nothing we can do. Another solution which may be worth to investigate is to clear the block manager for an executor before killing it. But I am not sure about this, as it introduces an overhead during the scale-down of containers.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21502
  
    **[Test build #91504 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91504/testReport)** for PR 21502 at commit [`147bd08`](https://github.com/apache/spark/commit/147bd08db09fe328de12069c9c0d8a849d99adf4).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org