You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by maropu <gi...@git.apache.org> on 2016/07/04 03:01:24 UTC

[GitHub] spark pull request #14039: [SPARK-15896][SQL] Clean up shuffle files just af...

GitHub user maropu opened a pull request:

    https://github.com/apache/spark/pull/14039

    [SPARK-15896][SQL] Clean up shuffle files just after jobs finished

    ## What changes were proposed in this pull request?
    Since `ShuffleRDD` in a SQL query could not be reuse later, this pr is to remove the shuffle files after finish a query to free the disk space as soon as possible.
    
    ## How was this patch tested?
    Manually checked all files were deleted just after jobs finished.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maropu/spark SPARK-15896

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14039.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14039
    
----
commit 4e56d5bb596954349093de3702420e51194ffa42
Author: Takeshi YAMAMURO <li...@gmail.com>
Date:   2016-06-28T22:35:17Z

    Clean up shuffle files just after jobs finished

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61702/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    @srowen My understanding is that shuffle data in stages are possibly shared in a job. However, once the job is finished, the current implementation cannot reuse the shuffle data anymore. So, we can safely remove them. Is this incorrect? Spark can reuse them between different jobs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    I don't think we do this in general. The shuffle files are supposed to remain to potentially be reused if the stage needs to be re-executed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    @srowen thanks for the comment. Yea, I noticed that and I'm fixing this to remove only shuffle files generated by `ShuffleExchange`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    **[Test build #61717 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61717/consoleFull)** for PR 14039 at commit [`daa859a`](https://github.com/apache/spark/commit/daa859aaa47d1fba502c8621751d7e49fe55c9fe).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    **[Test build #61702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61702/consoleFull)** for PR 14039 at commit [`4e56d5b`](https://github.com/apache/spark/commit/4e56d5bb596954349093de3702420e51194ffa42).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    **[Test build #61702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61702/consoleFull)** for PR 14039 at commit [`4e56d5b`](https://github.com/apache/spark/commit/4e56d5bb596954349093de3702420e51194ffa42).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61717/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    **[Test build #61717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61717/consoleFull)** for PR 14039 at commit [`daa859a`](https://github.com/apache/spark/commit/daa859aaa47d1fba502c8621751d7e49fe55c9fe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    **[Test build #61738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61738/consoleFull)** for PR 14039 at commit [`55c8e03`](https://github.com/apache/spark/commit/55c8e034f9a4e231d49c79a77631da58e6130afd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    Actually, they can be reused -- not in Spark as distributed, but it is an open question whether reusing shuffle files within Spark SQL is something that we should be doing and want to support.  It can be an effective alternative means of caching.  https://issues.apache.org/jira/browse/SPARK-13756
    
    Until that issue is definitively decided, we should not pre-empt the possibility with this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    I haven't got anything more concrete to offer at this time than the descriptions in the relevant JIRA's, but I do have this running in production with 1.6, and it does work.  Essentially, you build a cache in your application whose keys are a canonicalization of query fragments and whose values are RDDs associated with that fragment of the logical plan, and which produce the shuffle files.  For as long as you hold the references to those RDDs in your cache, Spark won't remove the shuffle files.  For as long as you have sufficient memory available to the OS, those shuffle files will be accessed via the OS buffer cache, which is actually pretty quick and doesn't require any of Java heap management and garbage collection.  That was the original motivation behind using shuffle files in this way and before off-heap caching and unified memory management were available.  It's less necessary now (at least once I figure out how to do the mapping between logical plan fragments and tables c
 ached off-heap), but it is still a valid alternative caching mechanism.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    **[Test build #61715 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61715/consoleFull)** for PR 14039 at commit [`891a100`](https://github.com/apache/spark/commit/891a1007a9bf8afdc9b1945ff597ccc458123ed7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14039: [SPARK-15896][SQL] Clean up shuffle files just af...

Posted by maropu <gi...@git.apache.org>.
Github user maropu closed the pull request at:

    https://github.com/apache/spark/pull/14039


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    **[Test build #61738 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61738/consoleFull)** for PR 14039 at commit [`55c8e03`](https://github.com/apache/spark/commit/55c8e034f9a4e231d49c79a77631da58e6130afd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61738/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    @markhamstra Thanks for the comment. I think the reuse of fragments highly depends on user's queries, catalyst optimizer, cluster resources... Reusing `ShuffledRowRDD` shuffle data in a single job is a good idea though, it seems difficult to stay the data in multiple jobs because spark cannot know when the data should be garbaged-collected and it possibly eats much disk space. I think caching mechanism is a better idea to reuse fragments in multiple jobs. Or,  do u have any smart/concrete idea to reuse the shuffle data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    **[Test build #61715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61715/consoleFull)** for PR 14039 at commit [`891a100`](https://github.com/apache/spark/commit/891a1007a9bf8afdc9b1945ff597ccc458123ed7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61715/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org