You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yhuai <gi...@git.apache.org> on 2015/02/19 07:54:33 UTC

[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/4689

    [SPARK-5881][SQL] RDD remains cached after the table gets overridden by "CACHE TABLE"

    See JIRA for the example to reproduce the bug (https://issues.apache.org/jira/browse/SPARK-5881).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark cacheFix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4689.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4689
    
----
commit f1787be424697146e18a9adfb2330b35d4de8415
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-18T23:29:23Z

    Add clear cache command.

commit 6cfd9df5bb88e744ddc232700679e0cb066a94ae
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-19T06:51:13Z

    Python clearCache.

commit d41d68868fc0cc2057e0686344e02f92cb3e054e
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-19T06:51:25Z

    When register a table as a temporary table, if there is an existing table that is cached and has a different query plan, uncache the existing table.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/4689#issuecomment-75084357
  
    I will open another PR to add the clearCache command. Closing this one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4689#discussion_r24976568
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSQLParser.scala ---
    @@ -74,9 +75,15 @@ private[sql] class SparkSQLParser(fallback: String => LogicalPlan) extends Abstr
         }
     
       private lazy val uncache: Parser[LogicalPlan] =
    -    UNCACHE ~ TABLE ~> ident ^^ {
    -      case tableName => UncacheTableCommand(tableName)
    -    }
    +    (
    +      UNCACHE ~ TABLE ~> ident ^^ {
    +        case tableName => UncacheTableCommand(tableName)
    +      }
    +    |
    +      CLEAR ~ CACHE ^^ {
    +        case _ ~ _ => ClearCacheCommand
    +      }
    +    )
    --- End diff --
    
    Nit: This can be simplified to
    
    ```scala
    ( UNCACHE ~ TABLE ~> ident ^^ {
        case tableName => UncacheTableCommand(tableName)
      }
    | CLEAR ~ CACHE ^^^ ClearCacheCommand
    )
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4689#discussion_r24999389
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/CacheManager.scala ---
    @@ -39,7 +39,7 @@ private case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryR
     private[sql] class CacheManager(sqlContext: SQLContext) extends Logging {
     
       @transient
    -  private val cachedData = new scala.collection.mutable.ArrayBuffer[CachedData]
    +  val cachedData = new scala.collection.mutable.ArrayBuffer[CachedData]
    --- End diff --
    
    Oh, right. Thanks for catching it. I was using it for debugging. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4689#issuecomment-75024215
  
    This PR also added a `CLEAR CACHE` statement, would be good to add this in the PR description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4689#discussion_r24976046
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -888,6 +893,13 @@ class SQLContext(@transient val sparkContext: SparkContext)
        * only during the lifetime of this instance of SQLContext.
        */
       private[sql] def registerDataFrameAsTable(df: DataFrame, tableName: String): Unit = {
    +    if (catalog.tableExists(Seq(tableName)) && cacheManager.lookupCachedData(df).isEmpty) {
    +      // If the table already exists and the data of df has not already been cached
    +      // (we are trying to overwrite an existing temporary table),
    +      // we will try to uncache the InMemoryRelation associated with the existing table.
    +      cacheManager.tryUncacheQuery(table(tableName))
    --- End diff --
    
    Should we add `blocking = false` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4689#discussion_r25000852
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -280,4 +280,43 @@ class CachedTableSuite extends QueryTest {
         assert(intercept[RuntimeException](table("t1")).getMessage.startsWith("Table Not Found"))
         assert(!isCached("t2"))
       }
    +
    +  test("Drop cached temporary table when the table gets overwritten") {
    +    val query1 = "SELECT key FROM testData LIMIT 10"
    +    val df1 = sql(query1)
    +    df1.registerTempTable("t1")
    +    sql(s"CACHE TABLE t2 AS $query1")
    +    assert(isCached("t2"))
    +    // t1 will be cached too because it has the same plan as t2.
    +    assert(isCached("t1"))
    +    assert(cacheManager.lookupCachedData(df1).isDefined)
    +
    +    val query2 = "SELECT key FROM testData LIMIT 5"
    +    val df2 = sql(query2)
    +    sql(s"CACHE TABLE t2 AS $query2")
    +    // t1 will not be cached because it has not been explicitly cached.
    +    assert(!isCached("t1"))
    +    assert(isCached("t2"))
    +    assert(cacheManager.lookupCachedData(df2).isDefined)
    +    dropTempTable("t2")
    +
    +    assert(cacheManager.lookupCachedData(df1).isEmpty)
    +    assert(cacheManager.lookupCachedData(df2).isEmpty)
    +  }
    --- End diff --
    
    Right, `t1` will be uncached. We need to also track the reference of a logical plan to make it correct. Let me just use this PR to add clearCache command.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4689#issuecomment-75015004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27710/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4689#issuecomment-75014993
  
      [Test build #27710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27710/consoleFull) for   PR 4689 at commit [`d41d688`](https://github.com/apache/spark/commit/d41d68868fc0cc2057e0686344e02f92cb3e054e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4689#issuecomment-75006857
  
      [Test build #27710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27710/consoleFull) for   PR 4689 at commit [`d41d688`](https://github.com/apache/spark/commit/d41d68868fc0cc2057e0686344e02f92cb3e054e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4689#discussion_r24975506
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/CacheManager.scala ---
    @@ -39,7 +39,7 @@ private case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryR
     private[sql] class CacheManager(sqlContext: SQLContext) extends Logging {
     
       @transient
    -  private val cachedData = new scala.collection.mutable.ArrayBuffer[CachedData]
    +  val cachedData = new scala.collection.mutable.ArrayBuffer[CachedData]
    --- End diff --
    
    Seems no need to make this public? Didn't find this referenced anywhere other than `CacheManager`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4689#discussion_r24977332
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -280,4 +280,43 @@ class CachedTableSuite extends QueryTest {
         assert(intercept[RuntimeException](table("t1")).getMessage.startsWith("Table Not Found"))
         assert(!isCached("t2"))
       }
    +
    +  test("Drop cached temporary table when the table gets overwritten") {
    +    val query1 = "SELECT key FROM testData LIMIT 10"
    +    val df1 = sql(query1)
    +    df1.registerTempTable("t1")
    +    sql(s"CACHE TABLE t2 AS $query1")
    +    assert(isCached("t2"))
    +    // t1 will be cached too because it has the same plan as t2.
    +    assert(isCached("t1"))
    +    assert(cacheManager.lookupCachedData(df1).isDefined)
    +
    +    val query2 = "SELECT key FROM testData LIMIT 5"
    +    val df2 = sql(query2)
    +    sql(s"CACHE TABLE t2 AS $query2")
    +    // t1 will not be cached because it has not been explicitly cached.
    +    assert(!isCached("t1"))
    +    assert(isCached("t2"))
    +    assert(cacheManager.lookupCachedData(df2).isDefined)
    +    dropTempTable("t2")
    +
    +    assert(cacheManager.lookupCachedData(df1).isEmpty)
    +    assert(cacheManager.lookupCachedData(df2).isEmpty)
    +  }
    --- End diff --
    
    How about this case:
    
    ```scala
    val df1 = sql("SELECT * FROM testData LIMIT 10")
    df1.registerTempTable("t1")
    
    // Cache t1 explicitly
    sql("CACHE TABLE t1")
    
    // t1 and t2 share the same query plan
    sql("CACHE TABLE t2 AS SELECT * FROM testData LIMIT 10")
    
    // Replace t2 with a different query plan
    sql("CACHE TABLE t2 AS SELECT * FROM testData LIMIT 5")
    
    // Should t1 remain cached here?
    ```
    
    To my understanding, with this PR, `t1` is implicitly uncached, which may not be the behavior we want. I think we also need a map from query plans to table names to prevent unexpected uncache operations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5881][SQL] RDD remains cached after the...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai closed the pull request at:

    https://github.com/apache/spark/pull/4689


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org