You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2018/01/24 15:30:21 UTC
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
GitHub user gatorsmile opened a pull request:
https://github.com/apache/spark/pull/20384
[SPARK-23195] [SQL] Keep the Hint of Cached Data
## What changes were proposed in this pull request?
The broadcast hint of the cached plan is lost if we cache the plan. This PR is to correct it.
```Scala
val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
broadcast(df2).cache()
df2.collect()
val df3 = df1.join(df2, Seq("key"), "inner")
```
## How was this patch tested?
Added a test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gatorsmile/spark test33
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20384.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20384
----
commit 4ef18b76b13c35dbec0d9e8ecfa753c9b4a80d2c
Author: gatorsmile <ga...@...>
Date: 2018-01-24T15:27:52Z
fix
commit 3bbec210bcfd759d36a6f7e809ed0f1e3d0a03d8
Author: gatorsmile <ga...@...>
Date: 2018-01-24T15:28:06Z
fix
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20384
**[Test build #86590 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86590/testReport)** for PR 20384 at commit [`3bbec21`](https://github.com/apache/spark/commit/3bbec210bcfd759d36a6f7e809ed0f1e3d0a03d8).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20384
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20384
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/194/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20384
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/20384#discussion_r163755332
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
}
test("broadcast hint is retained after using the cached data") {
- withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
- val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
- val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
- df2.cache()
- val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
- val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
- case b: BroadcastHashJoinExec => b
- }.size
- assert(numBroadCastHashJoin === 1)
+ try {
+ withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+ val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
+ val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
+ df2.cache()
+ val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+ val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
+ case b: BroadcastHashJoinExec => b
+ }.size
+ assert(numBroadCastHashJoin === 1)
+ }
+ } finally {
+ spark.catalog.clearCache()
--- End diff --
Yeah. That should be a separate bug.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20384
**[Test build #86590 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86590/testReport)** for PR 20384 at commit [`3bbec21`](https://github.com/apache/spark/commit/3bbec210bcfd759d36a6f7e809ed0f1e3d0a03d8).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/20384#discussion_r163586713
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
}
test("broadcast hint is retained after using the cached data") {
- withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
- val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
- val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
- df2.cache()
- val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
- val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
- case b: BroadcastHashJoinExec => b
- }.size
- assert(numBroadCastHashJoin === 1)
+ try {
+ withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+ val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
+ val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
+ df2.cache()
+ val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+ val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
+ case b: BroadcastHashJoinExec => b
+ }.size
+ assert(numBroadCastHashJoin === 1)
+ }
+ } finally {
+ spark.catalog.clearCache()
--- End diff --
If we have to clear the cache, can we add `clearCache` into `afterEach` in general instead of adding this case-by-case?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile closed the pull request at:
https://github.com/apache/spark/pull/20384
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/20384#discussion_r163581427
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
}
test("broadcast hint is retained after using the cached data") {
- withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
- val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
- val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
- df2.cache()
- val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
- val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
- case b: BroadcastHashJoinExec => b
- }.size
- assert(numBroadCastHashJoin === 1)
+ try {
+ withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+ val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
+ val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
+ df2.cache()
+ val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+ val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
+ case b: BroadcastHashJoinExec => b
+ }.size
+ assert(numBroadCastHashJoin === 1)
+ }
+ } finally {
+ spark.catalog.clearCache()
--- End diff --
We have to clear the cache. We have another bug to fix in the cache. Will submit another PR to fix the cache plan matching.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20384
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86594/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20384
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/20384
cc @cloud-fan @sameeragarwal @jiangxb1987
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20384
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/195/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20384
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86590/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20384
**[Test build #86594 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86594/testReport)** for PR 20384 at commit [`aa33da3`](https://github.com/apache/spark/commit/aa33da366d93a37fa89ed409307eab8bf70ee6a7).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/20384#discussion_r163602179
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
}
test("broadcast hint is retained after using the cached data") {
- withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
- val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
- val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
- df2.cache()
- val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
- val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
- case b: BroadcastHashJoinExec => b
- }.size
- assert(numBroadCastHashJoin === 1)
+ try {
+ withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+ val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
+ val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
+ df2.cache()
+ val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+ val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
+ case b: BroadcastHashJoinExec => b
+ }.size
+ assert(numBroadCastHashJoin === 1)
+ }
+ } finally {
+ spark.catalog.clearCache()
--- End diff --
This test suite is not for testing cache. Thus, it is fine to do it for these two test cases.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20384
**[Test build #86594 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86594/testReport)** for PR 20384 at commit [`aa33da3`](https://github.com/apache/spark/commit/aa33da366d93a37fa89ed409307eab8bf70ee6a7).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/20384#discussion_r163750722
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils {
}
test("broadcast hint is retained after using the cached data") {
- withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
- val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
- val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
- df2.cache()
- val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
- val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
- case b: BroadcastHashJoinExec => b
- }.size
- assert(numBroadCastHashJoin === 1)
+ try {
+ withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+ val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
+ val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
+ df2.cache()
+ val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+ val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
+ case b: BroadcastHashJoinExec => b
+ }.size
+ assert(numBroadCastHashJoin === 1)
+ }
+ } finally {
+ spark.catalog.clearCache()
--- End diff --
do you mean we can remove the cache cleaning here after you fix that bug?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20384
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org