You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by chenghao-intel <gi...@git.apache.org> on 2015/04/27 09:41:14 UTC

[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

GitHub user chenghao-intel opened a pull request:

    https://github.com/apache/spark/pull/5714

    [SPARK-7158] [SQL] Fix bug of cached data cannot be used in collect() after cache()

    When df.cache() method called, the `withCachedData` of `QueryExecution` has been created, which mean it will not look up the cached tables when action method called afterward.
    
    Replace the lazy variable with method will fix this bug.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chenghao-intel/spark SPARK-7158

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5714.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5714
    
----
commit e2c429829e0525b72eaf0d2879d735ab75072c43
Author: Cheng Hao <ha...@intel.com>
Date:   2015-04-27T07:16:50Z

    fix bug of cached data isn't used afterward in DF

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107558832
  
      [Test build #33895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33895/consoleFull) for   PR 5714 at commit [`58ea8aa`](https://github.com/apache/spark/commit/58ea8aab465bf63d6fdae1de62857c12fc128b4c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-96552127
  
      [Test build #30963 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30963/consoleFull) for   PR 5714 at commit [`e2c4298`](https://github.com/apache/spark/commit/e2c429829e0525b72eaf0d2879d735ab75072c43).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107522876
  
      [Test build #33892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33892/consoleFull) for   PR 5714 at commit [`0b296ea`](https://github.com/apache/spark/commit/0b296ea17b180a8a0be3fffb18a40c989dbb8914).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-96875606
  
      [Test build #31099 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31099/consoleFull) for   PR 5714 at commit [`c0dc28d`](https://github.com/apache/spark/commit/c0dc28de2c43be79b2e026082a1a82ee532fb2eb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97290748
  
      [Test build #31209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31209/consoleFull) for   PR 5714 at commit [`3b27c4f`](https://github.com/apache/spark/commit/3b27c4f4818d05a007e2de9ee21fa0df2951c2ba).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-96868267
  
      [Test build #31087 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31087/consoleFull) for   PR 5714 at commit [`b876ce3`](https://github.com/apache/spark/commit/b876ce3196fa6ee96d56498a2ddcab8e9711cec9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-96857479
  
      [Test build #31087 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31087/consoleFull) for   PR 5714 at commit [`b876ce3`](https://github.com/apache/spark/commit/b876ce3196fa6ee96d56498a2ddcab8e9711cec9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29313841
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -1121,23 +1121,23 @@ class SQLContext(@transient val sparkContext: SparkContext)
         def assertAnalyzed(): Unit = analyzer.checkAnalysis(analyzed)
     
         lazy val analyzed: LogicalPlan = analyzer.execute(logical)
    -    lazy val withCachedData: LogicalPlan = {
    +    def withCachedData: LogicalPlan = {
           assertAnalyzed()
           cacheManager.useCachedData(analyzed)
         }
    -    lazy val optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
    +    def optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
     
         // TODO: Don't just pick the first one...
    -    lazy val sparkPlan: SparkPlan = {
    +    def sparkPlan: SparkPlan = {
           SparkPlan.currentContext.set(self)
           planner(optimizedPlan).next()
         }
         // executedPlan should not be used to initialize any SparkPlan. It should be
         // only used for execution.
    -    lazy val executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
    +    def executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
     
         /** Internal version of the RDD. Avoids copies and has no schema */
    -    lazy val toRdd: RDD[Row] = executedPlan.execute()
    +    def toRdd: RDD[Row] = executedPlan.execute()
    --- End diff --
    
    Sorry, @liancheng for the confusing. We have 2 approaches for this fixing, however, the approach this PR takes will impact the existed code as the examples that I gave above. 
    
    ```
    sql("CREATE TABLE mytest AS SELECT * FROM src").collect()
    ```
    The CTAS will run twice, and will throws exception like TableAlreadyExisted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97328050
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29306154
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -1121,23 +1121,23 @@ class SQLContext(@transient val sparkContext: SparkContext)
         def assertAnalyzed(): Unit = analyzer.checkAnalysis(analyzed)
     
         lazy val analyzed: LogicalPlan = analyzer.execute(logical)
    -    lazy val withCachedData: LogicalPlan = {
    +    def withCachedData: LogicalPlan = {
           assertAnalyzed()
           cacheManager.useCachedData(analyzed)
         }
    -    lazy val optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
    +    def optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
     
         // TODO: Don't just pick the first one...
    -    lazy val sparkPlan: SparkPlan = {
    +    def sparkPlan: SparkPlan = {
           SparkPlan.currentContext.set(self)
           planner(optimizedPlan).next()
         }
         // executedPlan should not be used to initialize any SparkPlan. It should be
         // only used for execution.
    -    lazy val executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
    +    def executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
     
         /** Internal version of the RDD. Avoids copies and has no schema */
    -    lazy val toRdd: RDD[Row] = executedPlan.execute()
    +    def toRdd: RDD[Row] = executedPlan.execute()
    --- End diff --
    
    @liancheng why do we need to make rdd returning the same instance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107557687
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97283351
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97315014
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97388329
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97388333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31270/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97339058
  
    Another approach, to make the `DataFrame.queryExecution` as mutable, and replace it when `cache`/`persist`/`unpersist` invoked. Does that sound more reasonable to a DataFrame?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29216999
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ---
    @@ -71,6 +71,23 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll {
         )
       }
     
    +  test("SPARK-7158 collect and take return different results") {
    +    import java.util.UUID
    +    import org.apache.spark.sql.types._
    +    val rdd = sparkContext.parallelize(List(1, 2, 3), 2)
    --- End diff --
    
    I think you can make this test case much simpler than it is, e.g.
    
    ```scala
    val df = Seq(Tuple1(1)).toDF("col")
    ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-96857189
  
    Thank you @harsha2010 , I've updated the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29395147
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -117,7 +117,7 @@ private[sql] object DataFrame {
     @Experimental
     class DataFrame private[sql](
         @transient val sqlContext: SQLContext,
    -    @DeveloperApi @transient val queryExecution: SQLContext#QueryExecution)
    +    @DeveloperApi @transient var queryExecution: SQLContext#QueryExecution)
    --- End diff --
    
    But it's labled `@DeveloperApi`, does that mean a public attribute?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107271454
  
    Ok, I got your mean, creating its own query execution for `Cachemanager` is a good idea,
    but probably still has bug; Consider cases like (It's probably quite rare case, but possible from user code):
    ```scala
    val rdd = mydf.javaToPython // queryExecution.executedPlan will be called.
    mydf.cache()
    mydf.collect() // will not be benefit from the cache
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97328031
  
      [Test build #31248 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31248/consoleFull) for   PR 5714 at commit [`2005c94`](https://github.com/apache/spark/commit/2005c94aacfa231bca8df1f44ee7a9dd1aa9cf88).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29311951
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -1121,23 +1121,23 @@ class SQLContext(@transient val sparkContext: SparkContext)
         def assertAnalyzed(): Unit = analyzer.checkAnalysis(analyzed)
     
         lazy val analyzed: LogicalPlan = analyzer.execute(logical)
    -    lazy val withCachedData: LogicalPlan = {
    +    def withCachedData: LogicalPlan = {
           assertAnalyzed()
           cacheManager.useCachedData(analyzed)
         }
    -    lazy val optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
    +    def optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
     
         // TODO: Don't just pick the first one...
    -    lazy val sparkPlan: SparkPlan = {
    +    def sparkPlan: SparkPlan = {
           SparkPlan.currentContext.set(self)
           planner(optimizedPlan).next()
         }
         // executedPlan should not be used to initialize any SparkPlan. It should be
         // only used for execution.
    -    lazy val executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
    +    def executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
     
         /** Internal version of the RDD. Avoids copies and has no schema */
    -    lazy val toRdd: RDD[Row] = executedPlan.execute()
    +    def toRdd: RDD[Row] = executedPlan.execute()
    --- End diff --
    
    @rxin @liancheng I think the master code is quite reasonable to me, particularly for the `RunnableCommand`, we don't want to run them twice.
    e.g.
    ```
    sql("CREATE TABLE mytest AS SELECT * FROM src")
    sql("CREATE TABLE mytest AS SELECT * FROM src").collect()
    ```
    
    I've updated lots of code in the unit test, but now I am quite hesitate for the approach using `def` instead of the `lazy val`.
    
    Any ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97247359
  
    So I talked with @marmbrus offline about this, and we actually think it might be ok to just change all the lazy vals to defs (but keep the analyzed one as lazy val), as you originally did, and not worry about performance at the moment.
    
    The problem with the current approach is that cache no longer mutates the underlying dataframe, and users relying on the old dataframe reference will see the same problem.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107557977
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97330586
  
    Here is a simpler reproduction of this issue:
    
    ```scala
    import sqlContext.implicits._
    val df = sc.parallelize(Seq.empty[(Int, Int)]).toDF("key", "value").cache()
    println(df.queryExecution)
    ```
    
    output:
    
    ```
    == Parsed Logical Plan ==
    Project [_1#24 AS key#26,_2#25 AS value#27]
     LogicalRDD [_1#24,_2#25], MapPartitionsRDD[5] at mapPartitions at ExistingRDD.scala:35
    
    == Analyzed Logical Plan ==
    Project [_1#24 AS key#26,_2#25 AS value#27]
     LogicalRDD [_1#24,_2#25], MapPartitionsRDD[5] at mapPartitions at ExistingRDD.scala:35
    
    == Optimized Logical Plan ==
    Project [_1#24 AS key#26,_2#25 AS value#27]
     LogicalRDD [_1#24,_2#25], MapPartitionsRDD[5] at mapPartitions at ExistingRDD.scala:35
    
    == Physical Plan ==
    Project [_1#24 AS key#26,_2#25 AS value#27]
     PhysicalRDD [_1#24,_2#25], MapPartitionsRDD[5] at mapPartitions at ExistingRDD.scala:35
    
    Code Generation: false
    == RDD ==
    ```
    
    We can see that although `df` is cached, the query plan doesn't contain `InMemoryRelation`. The reason is that `DataFrame.persist()` calls `CacheManager.cacheQuery()` for some side effect (caching stuff), and then returns `this`. I'm thinking how about asking `DataFrame.persist()` to return a new `DataFrame` which shares the same logical plan but leverages cached data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-96904977
  
      [Test build #31099 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31099/consoleFull) for   PR 5714 at commit [`c0dc28d`](https://github.com/apache/spark/commit/c0dc28de2c43be79b2e026082a1a82ee532fb2eb).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by harsha2010 <gi...@git.apache.org>.
Github user harsha2010 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29155350
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ---
    @@ -71,6 +71,26 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll {
         )
       }
     
    +  test("SPARK-7158 collect and take return different results") {
    +    import java.util.UUID
    +    import org.apache.spark.sql.types._
    +    val rdd = sparkContext.parallelize(List(1,2,3), 2)
    +    val schema = StructType(List(StructField("index",IntegerType,true)))
    +    val df = createDataFrame(rdd.map(p => Row(p)), schema)
    +    def id:() => String = () => {UUID.randomUUID().toString()}
    +    def square:Int => Int = (x: Int) => {x * x}
    +    //expect the ID to have materialized at this point
    +    val dfWithId = df.withColumn("id",callUDF(id, StringType)).cache()
    +    val d1 = dfWithId.collect()
    +
    +    // ID should not be changed, as we will take the data from the cache directly
    --- End diff --
    
    this test could be simpler... no need for the additional dfWithIdAndSquare column or the square function...
    if we were not reading from cache, two consecutive invocations of dfWithId.collect will yield different IDs.
    Makes the test simpler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107271746
  
    Yeah, any user action that causes the plan to be materialized will have the same problem, but I'm not sure if thats a problem.  RDDs are immutable so users in general should not expect anything about their execution to change once they have been created / materialized.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97356983
  
      [Test build #31270 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31270/consoleFull) for   PR 5714 at commit [`22a5fe7`](https://github.com/apache/spark/commit/22a5fe78252dc807789d75b122c6e176a02bb6e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29216965
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -991,20 +991,27 @@ class DataFrame private[sql](
        */
       override def persist(): this.type = {
         sqlContext.cacheManager.cacheQuery(this)
    --- End diff --
    
    can this function call the other persist?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97315056
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107524624
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-96954733
  
      [Test build #31127 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31127/consoleFull) for   PR 5714 at commit [`a9bf8c1`](https://github.com/apache/spark/commit/a9bf8c1da90460472ac824a4ffbe43f3ea67bab9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107520707
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29312825
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -1121,23 +1121,23 @@ class SQLContext(@transient val sparkContext: SparkContext)
         def assertAnalyzed(): Unit = analyzer.checkAnalysis(analyzed)
     
         lazy val analyzed: LogicalPlan = analyzer.execute(logical)
    -    lazy val withCachedData: LogicalPlan = {
    +    def withCachedData: LogicalPlan = {
           assertAnalyzed()
           cacheManager.useCachedData(analyzed)
         }
    -    lazy val optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
    +    def optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
     
         // TODO: Don't just pick the first one...
    -    lazy val sparkPlan: SparkPlan = {
    +    def sparkPlan: SparkPlan = {
           SparkPlan.currentContext.set(self)
           planner(optimizedPlan).next()
         }
         // executedPlan should not be used to initialize any SparkPlan. It should be
         // only used for execution.
    -    lazy val executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
    +    def executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
     
         /** Internal version of the RDD. Avoids copies and has no schema */
    -    lazy val toRdd: RDD[Row] = executedPlan.execute()
    +    def toRdd: RDD[Row] = executedPlan.execute()
    --- End diff --
    
    The snippet you just mentioned should be OK. Even in master code, the `CREATE TABLE` command will also be executed twice. But the following doesn't:
    
    ```scala
    val df = sql("CREAT TABLE ...")
    df.collect()
    df.collect()
    ```
    
    The command is executed while constructing the result DataFrame.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29312058
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SpecificMutableRow.scala ---
    @@ -202,6 +202,7 @@ final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableR
             case DoubleType => new MutableDouble
             case BooleanType => new MutableBoolean
             case LongType => new MutableLong
    +        case DateType => new MutableInt // We use INT for DATE internally
    --- End diff --
    
    given the uncertainty about this pr at the moment, can you submit a separate pr to fix the date?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97388283
  
      [Test build #31270 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31270/consoleFull) for   PR 5714 at commit [`22a5fe7`](https://github.com/apache/spark/commit/22a5fe78252dc807789d75b122c6e176a02bb6e4).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch **adds the following new dependencies:**
       * `spark-unsafe_2.10-1.4.0-SNAPSHOT.jar`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29306409
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SpecificMutableRow.scala ---
    @@ -202,6 +202,7 @@ final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableR
             case DoubleType => new MutableDouble
             case BooleanType => new MutableBoolean
             case LongType => new MutableLong
    +        case DateType => new MutableInt // We use INT for DATE internally
    --- End diff --
    
    Without this, the `JDBCSuite`.`test DATE types` will fail, because we take the INTERGER for DATE type as the internal representation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by harsha2010 <gi...@git.apache.org>.
Github user harsha2010 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29155545
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ---
    @@ -71,6 +71,26 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll {
         )
       }
     
    +  test("SPARK-7158 collect and take return different results") {
    +    import java.util.UUID
    +    import org.apache.spark.sql.types._
    +    val rdd = sparkContext.parallelize(List(1,2,3), 2)
    +    val schema = StructType(List(StructField("index",IntegerType,true)))
    +    val df = createDataFrame(rdd.map(p => Row(p)), schema)
    +    def id:() => String = () => {UUID.randomUUID().toString()}
    +    def square:Int => Int = (x: Int) => {x * x}
    +    //expect the ID to have materialized at this point
    --- End diff --
    
    // expect ...
    the scala style guideline for comments require a space after //


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97356919
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-96981596
  
      [Test build #31127 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31127/consoleFull) for   PR 5714 at commit [`a9bf8c1`](https://github.com/apache/spark/commit/a9bf8c1da90460472ac824a4ffbe43f3ea67bab9).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97290752
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31209/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97290750
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97283364
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29312103
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SpecificMutableRow.scala ---
    @@ -202,6 +202,7 @@ final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableR
             case DoubleType => new MutableDouble
             case BooleanType => new MutableBoolean
             case LongType => new MutableLong
    +        case DateType => new MutableInt // We use INT for DATE internally
    --- End diff --
    
    Yeah, I will do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97315277
  
      [Test build #31248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31248/consoleFull) for   PR 5714 at commit [`2005c94`](https://github.com/apache/spark/commit/2005c94aacfa231bca8df1f44ee7a9dd1aa9cf88).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107520589
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97331420
  
    Yes, that's that I did earlier in this PR, I tend to switch to the that approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29328130
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -117,7 +117,7 @@ private[sql] object DataFrame {
     @Experimental
     class DataFrame private[sql](
         @transient val sqlContext: SQLContext,
    -    @DeveloperApi @transient val queryExecution: SQLContext#QueryExecution)
    +    @DeveloperApi @transient var queryExecution: SQLContext#QueryExecution)
    --- End diff --
    
    We need to restrict this var to private, and provide a read only accessor to this var.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97352865
  
    @chenghao-intel I agree that this conforms to current `DataFrame`/`RDD` semantics and fixes the problem. But I doubt whether this semantics worth adding another mutable state.
    
    @rxin What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107277988
  
    Ok, I will update the code to create the queryExecution within `CacheManager`, thanks for the suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97332291
  
    But this will lead to un-consistent usage with `RDD.cache()` which user don't need to do something like `rdd = rdd.cache()`. Is that OK? (personally I prefer the return-new-DataFrame way...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97356908
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29208973
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -1121,23 +1121,23 @@ class SQLContext(@transient val sparkContext: SparkContext)
         def assertAnalyzed(): Unit = analyzer.checkAnalysis(analyzed)
     
         lazy val analyzed: LogicalPlan = analyzer.execute(logical)
    -    lazy val withCachedData: LogicalPlan = {
    +    def withCachedData: LogicalPlan = {
           assertAnalyzed()
           cacheManager.useCachedData(analyzed)
         }
    -    lazy val optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
    +    def optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
     
         // TODO: Don't just pick the first one...
    -    lazy val sparkPlan: SparkPlan = {
    +    def sparkPlan: SparkPlan = {
           SparkPlan.currentContext.set(self)
           planner(optimizedPlan).next()
         }
         // executedPlan should not be used to initialize any SparkPlan. It should be
         // only used for execution.
    -    lazy val executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
    +    def executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
     
         /** Internal version of the RDD. Avoids copies and has no schema */
    -    lazy val toRdd: RDD[Row] = executedPlan.execute()
    +    def toRdd: RDD[Row] = executedPlan.execute()
    --- End diff --
    
    Yes, thanks for reminding, I will submit code with the another approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97328051
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31248/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97335761
  
    @cloud-fan Yeah, that's a good point. @rxin also mentioned this offline just now. However, hm... I guess I should argue that `DataFrame` isn't RDD, it just looks like RDD ;-)
    
    I think the problem here is that although in most cases RDD is immutable, it does has mutable state like storage level. And this single mutable state makes RDD identities (references) matter. Personally I think `RDD.persist()` should also return a new RDD instance instead of rely on side effect from the very beginning. But we can't change that part of API now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107524616
  
      [Test build #33892 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33892/consoleFull) for   PR 5714 at commit [`0b296ea`](https://github.com/apache/spark/commit/0b296ea17b180a8a0be3fffb18a40c989dbb8914).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107636310
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-107636279
  
      [Test build #33895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33895/consoleFull) for   PR 5714 at commit [`58ea8aa`](https://github.com/apache/spark/commit/58ea8aab465bf63d6fdae1de62857c12fc128b4c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-96553220
  
      [Test build #30963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30963/consoleFull) for   PR 5714 at commit [`e2c4298`](https://github.com/apache/spark/commit/e2c429829e0525b72eaf0d2879d735ab75072c43).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5714#discussion_r29207799
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -1121,23 +1121,23 @@ class SQLContext(@transient val sparkContext: SparkContext)
         def assertAnalyzed(): Unit = analyzer.checkAnalysis(analyzed)
     
         lazy val analyzed: LogicalPlan = analyzer.execute(logical)
    -    lazy val withCachedData: LogicalPlan = {
    +    def withCachedData: LogicalPlan = {
           assertAnalyzed()
           cacheManager.useCachedData(analyzed)
         }
    -    lazy val optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
    +    def optimizedPlan: LogicalPlan = optimizer.execute(withCachedData)
     
         // TODO: Don't just pick the first one...
    -    lazy val sparkPlan: SparkPlan = {
    +    def sparkPlan: SparkPlan = {
           SparkPlan.currentContext.set(self)
           planner(optimizedPlan).next()
         }
         // executedPlan should not be used to initialize any SparkPlan. It should be
         // only used for execution.
    -    lazy val executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
    +    def executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan)
     
         /** Internal version of the RDD. Avoids copies and has no schema */
    -    lazy val toRdd: RDD[Row] = executedPlan.execute()
    +    def toRdd: RDD[Row] = executedPlan.execute()
    --- End diff --
    
    Is this conflict with https://github.com/apache/spark/pull/5265?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7158] [SQL] Fix bug of cached data cann...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5714#issuecomment-97283412
  
      [Test build #31209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31209/consoleFull) for   PR 5714 at commit [`3b27c4f`](https://github.com/apache/spark/commit/3b27c4f4818d05a007e2de9ee21fa0df2951c2ba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org