You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2018/01/24 15:30:21 UTC

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/20384

    [SPARK-23195] [SQL] Keep the Hint of Cached Data

    ## What changes were proposed in this pull request?
    The broadcast hint of the cached plan is lost if we cache the plan. This PR is to correct it.
    
    ```Scala
      val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
      val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
      broadcast(df2).cache()
      df2.collect()
      val df3 = df1.join(df2, Seq("key"), "inner")
    ```
    
    ## How was this patch tested?
    Added a test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark test33

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20384.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20384
    
----
commit 4ef18b76b13c35dbec0d9e8ecfa753c9b4a80d2c
Author: gatorsmile <ga...@...>
Date:   2018-01-24T15:27:52Z

    fix

commit 3bbec210bcfd759d36a6f7e809ed0f1e3d0a03d8
Author: gatorsmile <ga...@...>
Date:   2018-01-24T15:28:06Z

    fix

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    **[Test build #86590 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86590/testReport)** for PR 20384 at commit [`3bbec21`](https://github.com/apache/spark/commit/3bbec210bcfd759d36a6f7e809ed0f1e3d0a03d8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/194/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20384#discussion_r163755332
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
    @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
       }
     
       test("broadcast hint is retained after using the cached data") {
    -    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    -      val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    -      val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    -      df2.cache()
    -      val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    -      val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    -        case b: BroadcastHashJoinExec => b
    -      }.size
    -      assert(numBroadCastHashJoin === 1)
    +    try {
    +      withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    +        val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    +        val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    +        df2.cache()
    +        val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    +        val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    +          case b: BroadcastHashJoinExec => b
    +        }.size
    +        assert(numBroadCastHashJoin === 1)
    +      }
    +    } finally {
    +      spark.catalog.clearCache()
    --- End diff --
    
    Yeah. That should be a separate bug. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    **[Test build #86590 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86590/testReport)** for PR 20384 at commit [`3bbec21`](https://github.com/apache/spark/commit/3bbec210bcfd759d36a6f7e809ed0f1e3d0a03d8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20384#discussion_r163586713
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
    @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
       }
     
       test("broadcast hint is retained after using the cached data") {
    -    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    -      val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    -      val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    -      df2.cache()
    -      val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    -      val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    -        case b: BroadcastHashJoinExec => b
    -      }.size
    -      assert(numBroadCastHashJoin === 1)
    +    try {
    +      withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    +        val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    +        val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    +        df2.cache()
    +        val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    +        val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    +          case b: BroadcastHashJoinExec => b
    +        }.size
    +        assert(numBroadCastHashJoin === 1)
    +      }
    +    } finally {
    +      spark.catalog.clearCache()
    --- End diff --
    
    If we have to clear the cache, can we add `clearCache` into `afterEach` in general instead of adding this case-by-case?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile closed the pull request at:

    https://github.com/apache/spark/pull/20384


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20384#discussion_r163581427
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
    @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
       }
     
       test("broadcast hint is retained after using the cached data") {
    -    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    -      val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    -      val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    -      df2.cache()
    -      val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    -      val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    -        case b: BroadcastHashJoinExec => b
    -      }.size
    -      assert(numBroadCastHashJoin === 1)
    +    try {
    +      withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    +        val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    +        val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    +        df2.cache()
    +        val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    +        val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    +          case b: BroadcastHashJoinExec => b
    +        }.size
    +        assert(numBroadCastHashJoin === 1)
    +      }
    +    } finally {
    +      spark.catalog.clearCache()
    --- End diff --
    
    We have to clear the cache. We have another bug to fix in the cache. Will submit another PR to fix the cache plan matching. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86594/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    cc @cloud-fan @sameeragarwal @jiangxb1987 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/195/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86590/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    **[Test build #86594 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86594/testReport)** for PR 20384 at commit [`aa33da3`](https://github.com/apache/spark/commit/aa33da366d93a37fa89ed409307eab8bf70ee6a7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20384#discussion_r163602179
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
    @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
       }
     
       test("broadcast hint is retained after using the cached data") {
    -    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    -      val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    -      val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    -      df2.cache()
    -      val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    -      val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    -        case b: BroadcastHashJoinExec => b
    -      }.size
    -      assert(numBroadCastHashJoin === 1)
    +    try {
    +      withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    +        val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    +        val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    +        df2.cache()
    +        val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    +        val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    +          case b: BroadcastHashJoinExec => b
    +        }.size
    +        assert(numBroadCastHashJoin === 1)
    +      }
    +    } finally {
    +      spark.catalog.clearCache()
    --- End diff --
    
    This test suite is not for testing cache. Thus, it is fine to do it for these two test cases.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    **[Test build #86594 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86594/testReport)** for PR 20384 at commit [`aa33da3`](https://github.com/apache/spark/commit/aa33da366d93a37fa89ed409307eab8bf70ee6a7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20384#discussion_r163750722
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
    @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
       }
     
       test("broadcast hint is retained after using the cached data") {
    -    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    -      val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    -      val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    -      df2.cache()
    -      val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    -      val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    -        case b: BroadcastHashJoinExec => b
    -      }.size
    -      assert(numBroadCastHashJoin === 1)
    +    try {
    +      withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
    +        val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
    +        val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
    +        df2.cache()
    +        val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
    +        val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
    +          case b: BroadcastHashJoinExec => b
    +        }.size
    +        assert(numBroadCastHashJoin === 1)
    +      }
    +    } finally {
    +      spark.catalog.clearCache()
    --- End diff --
    
    do you mean we can remove the cache cleaning here after you fix that bug?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20384
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org