You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2016/05/31 19:00:52 UTC

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/13419

    [SPARK-15678][SQL] Drop cache on appends and overwrites

    ## What changes were proposed in this pull request?
    
    SparkSQL currently doesn't drop caches if the underlying data is overwritten. This PR fixes that behavior.
    
    ```scala
    val dir = "/tmp/test"
    sqlContext.range(1000).write.mode("overwrite").parquet(dir)
    val df = sqlContext.read.parquet(dir).cache()
    df.count() // outputs 1000
    sqlContext.range(10).write.mode("overwrite").parquet(dir)
    sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 <---- We are still using the cached dataset
    ```
    
    ## How was this patch tested?
    
    Unit tests for overwrites and appends in `ParquetQuerySuite`.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark drop-cache-on-write

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13419
    
----
commit ee631d2d98f72d99da00d8922fc4cf6a66cf063c
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-05-31T18:27:41Z

    Drop cache on appends and overwrites

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    Hi, @sameeragarwal .
    Is there any reason to use `SQLContext` instead of `SparkSession` in this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Not use cache on appends and overwrit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    **[Test build #59706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59706/consoleFull)** for PR 13419 at commit [`a21013a`](https://github.com/apache/spark/commit/a21013ab27a7400c850aebe66dfcd0e7e6c2ea3c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Not use cache on appends and overwrit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    **[Test build #59706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59706/consoleFull)** for PR 13419 at commit [`a21013a`](https://github.com/apache/spark/commit/a21013ab27a7400c850aebe66dfcd0e7e6c2ea3c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13419: [SPARK-15678][SQL] Not use cache on appends and o...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal closed the pull request at:

    https://github.com/apache/spark/pull/13419


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13419: [SPARK-15678][SQL] Not use cache on appends and overwrit...

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/13419
  
    I guess that the caching is done over multiple nodes. If the data for a dataset is updated physically and some of the nodes where the data was cached go down, would the existing `cached` dataset be invalidated and refreshed ? If not, then old dataframes can give inconsistent or incomplete data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13419#discussion_r65251574
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala ---
    @@ -67,6 +67,28 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext
           TableIdentifier("tmp"), ignoreIfNotExists = true)
       }
     
    +  test("drop cache on overwrite") {
    +    withTempDir { dir =>
    +      val path = dir.toString
    +      spark.range(1000).write.mode("overwrite").parquet(path)
    +      val df = sqlContext.read.parquet(path).cache()
    +      assert(df.count() == 1000)
    +      sqlContext.range(10).write.mode("overwrite").parquet(path)
    +      assert(sqlContext.read.parquet(path).count() == 10)
    +    }
    +  }
    +
    +  test("drop cache on append") {
    +    withTempDir { dir =>
    +      val path = dir.toString
    +      spark.range(1000).write.mode("append").parquet(path)
    +      val df = sqlContext.read.parquet(path).cache()
    +      assert(df.count() == 1000)
    +      sqlContext.range(10).write.mode("append").parquet(path)
    --- End diff --
    
    sqlContext -> spark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    @dongjoon-hyun no reason; old habits. I'll fix this. Thanks! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Not use cache on appends and overwrit...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    Also cc'ing @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    **[Test build #59668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59668/consoleFull)** for PR 13419 at commit [`ee631d2`](https://github.com/apache/spark/commit/ee631d2d98f72d99da00d8922fc4cf6a66cf063c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    **[Test build #59668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59668/consoleFull)** for PR 13419 at commit [`ee631d2`](https://github.com/apache/spark/commit/ee631d2d98f72d99da00d8922fc4cf6a66cf063c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13419: [SPARK-15678][SQL] Not use cache on appends and overwrit...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/13419
  
    I ended up creating a small design doc describing the problem and presenting 2 possible solutions at https://docs.google.com/document/d/1h5SzfC5UsvIrRpeLNDKSMKrKJvohkkccFlXo-GBAwQQ/edit?ts=574f717f#. Based on this, we decided in favor of option 2 (https://github.com/apache/spark/pull/13566) as it is a less intrusive change to the default behavior. I'm going to close this PR for now, but we may revisit this approach (i.e., option 1) for 2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    @yhuai @mengxr what are your thoughts on this approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13419#discussion_r65251560
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala ---
    @@ -67,6 +67,28 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext
           TableIdentifier("tmp"), ignoreIfNotExists = true)
       }
     
    +  test("drop cache on overwrite") {
    +    withTempDir { dir =>
    +      val path = dir.toString
    +      spark.range(1000).write.mode("overwrite").parquet(path)
    +      val df = sqlContext.read.parquet(path).cache()
    +      assert(df.count() == 1000)
    +      sqlContext.range(10).write.mode("overwrite").parquet(path)
    --- End diff --
    
    sqlContext -> spark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13419: [SPARK-15678][SQL] Not use cache on appends and overwrit...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/13419
  
    @tejasapatil if the nodes where the data was cached go down, the CacheManager should still consider that data as cached. In that case, the next time the data is accessed, the underlying RDD will be recomputed and cached again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59668/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Not use cache on appends and overwrit...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    @mengxr it seems like overwriting generates new files so we can achieve the same semantics without introducing an additional timestamp. The current solution should respect the contract for old dataframes while making sure that the new ones don't use the cached value. Let me know what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    I will prefer refreshing the dataset every time a dataset is reloaded but keeping existing ones unchanged.
    
    ~~~scala
    val df1 = sqlContext.read.parquet(dir).cache()
    df1.count() // outputs 1000
    sqlContext.range(10).write.mode("overwrite").parquet(dir)
    val df2 = sqlContext.read.parquet(dir).count() // outputs 10
    df2.count() // outputs 10
    df1.count() // still outputs 1000 because it was cached
    ~~~
    
    Neither approach is perfectly safe. So I don't have no strong preference on either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Not use cache on appends and overwrit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59706/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15678][SQL] Not use cache on appends and overwrit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13419
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org