You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2015/04/11 21:31:38 UTC

[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/5475

    [SPARK-6844][SQL] Clean up accumulators used in InMemoryRelation when it is uncached

    JIRA: https://issues.apache.org/jira/browse/SPARK-6844

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 cache_memory_leak

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5475.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5475
    
----
commit 1c3b06e48dc6a0e6ab01b4ff84a43c42aa140d0e
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2015-04-11T19:27:16Z

    Clean up accumulators used in InMemoryRelation when it is uncached.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-92021462
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30108/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-92021459
  
      [Test build #30108 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30108/consoleFull) for   PR 5475 at commit [`26c9bb6`](https://github.com/apache/spark/commit/26c9bb61b70391f2408bbb0846a8e486baf3ec4d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-93176357
  
      [Test build #30291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30291/consoleFull) for   PR 5475 at commit [`0b41235`](https://github.com/apache/spark/commit/0b41235b4becf1cd7becd399796bc29d41f5f66e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-91911133
  
      [Test build #30085 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30085/consoleFull) for   PR 5475 at commit [`1c3b06e`](https://github.com/apache/spark/commit/1c3b06e48dc6a0e6ab01b4ff84a43c42aa140d0e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-91924793
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30085/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #5475: [SPARK-6844][SQL] Clean up accumulators used in In...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5475#discussion_r161254105
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -297,4 +298,21 @@ class CachedTableSuite extends QueryTest {
         sql("Clear CACHE")
         assert(cacheManager.isEmpty)
       }
    +
    +  test("Clear accumulators when uncacheTable to prevent memory leaking") {
    +    val accsSize = Accumulators.originals.size
    +
    +    sql("SELECT key FROM testData LIMIT 10").registerTempTable("t1")
    +    sql("SELECT key FROM testData LIMIT 5").registerTempTable("t2")
    --- End diff --
    
    We also need to drop these two temp views after tests


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-93098023
  
    For `InMemoryColumnarTableScan` if they are only every allocated during testing, I think it is okay if they leak.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-91931540
  
    Thanks for working on this and for adding a test!  I have two issues with the way this is being implemented:
     - We are still leaking accumulators on a per table scan basis, and so you are still going to have a problem if you never explicitly uncache the table
     - It seems like we should be able to do this without adding a global hash map.  In particular, when you are uncaching you have a handle to the InMemoryRelation.  Perhaps that class should define an `uncache` method on this class that takes care of both `unpersist`ing the data and unregistering the accumulators.
    
    The ones inside of the table scan are clearly harder to deal with, but perhaps we can just make those lazy vals that are only initialized during testing due to some conf being set?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-93176378
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30291/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-91924790
  
      [Test build #30085 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30085/consoleFull) for   PR 5475 at commit [`1c3b06e`](https://github.com/apache/spark/commit/1c3b06e48dc6a0e6ab01b4ff84a43c42aa140d0e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-93146300
  
    ok. updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-92000380
  
    Thanks for commenting.
    
    For the first issue, I have registered the accumulators of table scan `InMemoryColumnarTableScan`.  If users explicitly uncache the table, these accumulators can be released as I tested. If the table is not explicitly uncached, these accumulators will be leaked as you said. Using a configuration to initialize them sounds good. I will try it.
    
    For the second one, I have tried doing this without a global hash map. However, because the cache manager will call `InMemoryRelation.withOutput` to create a new `InMemoryRelation` each time replacing the logical plan with cached version. The accumulators will be created in these `InMemoryRelation`s. But the cache manager only keeps the original `InMemoryRelation` and we uncache based on it. So I add this global hash map to deal with this problem.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #5475: [SPARK-6844][SQL] Clean up accumulators used in In...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5475#discussion_r161253866
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/columnar/PartitionBatchPruningSuite.scala ---
    @@ -39,6 +39,8 @@ class PartitionBatchPruningSuite extends FunSuite with BeforeAndAfterAll with Be
     
         // Enable in-memory partition pruning
         setConf(SQLConf.IN_MEMORY_PARTITION_PRUNING, "true")
    +    // Enable in-memory table scan accumulators
    +    setConf("spark.sql.inMemoryTableScanStatistics.enable", "true")
    --- End diff --
    
    Why not setting `spark.sql.inMemoryTableScanStatistics.enable` back `false`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-92513734
  
    I see, thanks for clarifying.  It seems bad that we are creating accumulators per copy anyway then.  Can we just pass in the accumulator with the buffers and statistics instead so that there is only ever one copy?
    
    Honestly, the best solution would probably be to make the `InMemoryRelation` a `BaseRelation` so that it does not have to handle `newInstance` calls at all.  Unfortunately this API did not exist when that class was created.  However, I'm not sure its worth refactoring at this point.  This code was ported from Shark and I'd like to get rid of it completely in the next release or two.
    
    For the second part, it would be great if we could just get rid of the accumulators completely. It feels bad to have all this extra logic only for a test.  Can you think of any other way that we could test this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-93147903
  
      [Test build #30290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30290/consoleFull) for   PR 5475 at commit [`dc1d5d5`](https://github.com/apache/spark/commit/dc1d5d52493cd7317bf54580315d2a96e3814aeb).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-92664701
  
    i just found that we can only allocate the accumulator for first one `InMemoryRelation` instead of for all following `InMemoryRelation`. Because the following relations only need to use the pre-computed statistics of the first relation.
    
    The accumulators in `InMemoryColumnarTableScan` are harder to completely removed if we want to collect these statistics, because these statistics are computed during the RDD operations. Accumulator seems the only method we can pass the information out. Currently I still apply a configuration to control if we want to allocate accumulators. But now I put these accumulators in an array in `InMemoryRelation`, instead of a global map before. Of course this array needs to be propagated to following relations because `InMemoryColumnarTableScan` is depending on later relations not the original one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-93147563
  
      [Test build #30290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30290/consoleFull) for   PR 5475 at commit [`dc1d5d5`](https://github.com/apache/spark/commit/dc1d5d52493cd7317bf54580315d2a96e3814aeb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-93147915
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30290/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-93553658
  
    Thanks!  Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5475


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-93150230
  
      [Test build #30291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30291/consoleFull) for   PR 5475 at commit [`0b41235`](https://github.com/apache/spark/commit/0b41235b4becf1cd7becd399796bc29d41f5f66e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6844][SQL] Clean up accumulators used i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5475#issuecomment-92004134
  
      [Test build #30108 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30108/consoleFull) for   PR 5475 at commit [`26c9bb6`](https://github.com/apache/spark/commit/26c9bb61b70391f2408bbb0846a8e486baf3ec4d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org