You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/10 16:07:41 UTC

[GitHub] [spark] wangyum opened a new pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

wangyum opened a new pull request #29065:
URL: https://github.com/apache/spark/pull/29065


   ### What changes were proposed in this pull request?
   
   We can improve the performance of some joins by pre-filtering one side of a join using the values from the other side of the join. In order to verify that Bloom Filter Join is effective, we first make some proof:
   
   1. Reduce shuffle data can improve performance
      - Can not improve performance for broadcast join after pre-filtering.
   
           | Default | pre-filter `ss_store_sk in(41, 543, 694)` |
           |:----------:|----------|
           | create table test.case1 using parquet as  SELECT s1.* FROM store_sales s1 join store on s_store_sk = ss_store_sk AND s_state IN ('TN') AND s_zip < 30534 |  create table test.case1 using parquet as SELECT s1.* FROM store_sales s1 join store on s_store_sk = ss_store_sk AND s_state IN ('TN') AND s_zip < 30534 and ss_store_sk in(41, 543, 694)
           | <img src="https://user-images.githubusercontent.com/5399861/87167941-4f88c080-c300-11ea-99b7-1e25a2eb5808.png" width="410"> |  <img src="https://user-images.githubusercontent.com/5399861/87167936-4d266680-c300-11ea-9e34-5d097c4cd4f0.png" width="410">|
   
      - Can improve performance for sort merge join after pre-filtering.
           | Default | pre-filter `ss_store_sk in(41, 543, 694)` |
           |:----------:|----------|
           | create table test.case2 using parquet as  SELECT s1.* FROM store_sales s1 join store on s_store_sk = ss_store_sk AND s_state IN ('TN') AND s_zip < 30534 |  create table test.case2 using parquet as SELECT s1.* FROM store_sales s1 join store on s_store_sk = ss_store_sk AND s_state IN ('TN') AND s_zip < 30534 and ss_store_sk in(41, 543, 694)
           | <img src="https://user-images.githubusercontent.com/5399861/87168933-c8d4e300-c301-11ea-9ec3-dd1019730bf1.png" width="410"> |  <img src="https://user-images.githubusercontent.com/5399861/87168936-cb373d00-c301-11ea-97d1-49a2ca2aff3e.png" width="410">|
   
   
   2. It is difficult to evaluate dynamic Min-Max runtime-filtering are effective. For example: Bloom Filter can push more data than Min-Max Filter for TPC-DS q95
       - Calculate `ws1.ws_ship_date_sk` Min-Max value base on`d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE) + INTERVAL 60 DAY) AND ws1.ws_ship_date_sk = d_date_sk`
      ```sql
      spark-sql> select min(d_date_sk), max(d_date_sk) from date_dim WHERE d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE) + INTERVAL 60 DAY);
      2451211	2451271
      ```
      -  Add new predicate `ws1.ws_ship_date_sk >= 2451211 AND ws1.ws_ship_date_sk <= 2451271` for Min-Max Filter :
           | Default |  Min-Max Filter | Bloom Filter |
           |:----------:|----------|----------|
           | <img src="https://user-images.githubusercontent.com/5399861/87174683-20774c80-c30a-11ea-96c1-72359aea650e.png" width="247"> |  <img src="https://user-images.githubusercontent.com/5399861/87174529-e148fb80-c309-11ea-8076-49224e856e5e.png" width="247"> | <img src="https://user-images.githubusercontent.com/5399861/87174522-dc844780-c309-11ea-9062-ec2211884003.png" width="247">|
   
   
   
   
   3. Evaluate dynamic Bloom Filter runtime-filtering by TPCDS.
   
   Query | Default(Seconds) | Enable Bloom Filter Join(Seconds)
   -- | -- | --
   tpcds q16 | 84 | 46
   tpcds q36 | 29 | 21
   tpcds q57 | 39 | 28
   tpcds q94 | 42 | 34
   tpcds q95 | 306 | 288
   
   
   
   TODO:
     1. `BuildBloomFilter` and `InBloomFilter` support codegen.
     2. Add a new `DynamicFilter` and `DynamicFilter` should support filter pushdown.
     3. BroadcastExchange reuse.
     4. Replace BloomFilter to In predicate if values less than spark.sql.parquet.pushdown.inFilterThreshold.
   
   ### Why are the changes needed?
   
   Improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
    
   TODO.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29065:
URL: https://github.com/apache/spark/pull/29065#discussion_r467917432



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala
##########
@@ -37,24 +37,24 @@ trait DynamicPruning extends Predicate
  *  can reuse the results of the broadcast through ReuseExchange
  * @param broadcastKeyIndex the index of the filtering key collected from the broadcast
  */
-case class DynamicPruningSubquery(
+case class PartitionPruningSubquery(
     pruningKey: Expression,
     buildQuery: LogicalPlan,
     buildKeys: Seq[Expression],
     broadcastKeyIndex: Int,
     onlyInBroadcast: Boolean,
     exprId: ExprId = NamedExpression.newExprId)
   extends SubqueryExpression(buildQuery, Seq(pruningKey), exprId)
-  with DynamicPruning
-  with Unevaluable {
+    with DynamicPruning
+    with Unevaluable {

Review comment:
       The original indentation is correct.
   - https://github.com/databricks/scala-style-guide/blob/master/README.md#indent




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-664808658


   @jovany-wang Thank you very much for your suggestion. I appreciate the time and effort you have spent to share your insightful comments, which will be seriously considered.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-671522005


   **[Test build #127281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127281/testReport)** for PR 29065 at commit [`94bfb36`](https://github.com/apache/spark/commit/94bfb36c4791772183a82cf4565fdea7ef7fb460).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-664810904






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-656757326






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-688647916






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-688647269


   **[Test build #128383 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128383/testReport)** for PR 29065 at commit [`1a8cc9b`](https://github.com/apache/spark/commit/1a8cc9b2718c820df7962c82b7ac1007c7712b1c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29065:
URL: https://github.com/apache/spark/pull/29065#discussion_r467918094



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala
##########
@@ -67,7 +67,58 @@ case class DynamicPruningSubquery(
       pruningKey.dataType == buildKeys(broadcastKeyIndex).dataType
   }
 
-  override def toString: String = s"dynamicpruning#${exprId.id} $conditionString"
+  override def toString: String = s"partitionpruning#${exprId.id} $conditionString"
+
+  override lazy val canonicalized: DynamicPruning = {
+    copy(
+      pruningKey = pruningKey.canonicalized,
+      buildQuery = buildQuery.canonicalized,
+      buildKeys = buildKeys.map(_.canonicalized),
+      exprId = ExprId(0))
+  }
+}
+
+/**
+ * The BloomFilterPruningSubquery expression is only used in join operations to prune one side of
+ * the join with a filter from the other side of the join. It is inserted in cases where shuffle
+ * pruning can be applied.
+ *
+ * @param pruningKey the filtering key of the plan to be pruned.
+ * @param buildQuery the build side of the join.
+ * @param buildKeys the join keys corresponding to the build side of the join
+ * @param broadcastKeyIndex the index of the filtering key collected from the broadcast
+ */
+case class BloomFilterPruningSubquery(
+    pruningKey: Expression,
+    buildQuery: LogicalPlan,
+    buildKeys: Seq[Expression],
+    broadcastKeyIndex: Int,
+    exprId: ExprId = NamedExpression.newExprId)
+  extends SubqueryExpression(buildQuery, Seq(pruningKey), exprId)
+    with DynamicPruning
+    with Unevaluable {

Review comment:
       Please see https://github.com/databricks/scala-style-guide/blob/master/README.md#indent and adjust the indentation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-656873796


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-703606306


   > Reduce the shuffle data can significantly improve the query performance
   
   btw, IMHO `ShufflePruning` looks a bit misleading. I thought first this PR targets at removing shuffle exchanges by runtime filters.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-656757326






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-703605031


   What's the current status of this PR? Waiting for reviews?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-656756847


   **[Test build #125627 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125627/testReport)** for PR 29065 at commit [`a47485b`](https://github.com/apache/spark/commit/a47485b2e60035fc760372d59b6bb663e2c0d6a7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-671365814


   **[Test build #127281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127281/testReport)** for PR 29065 at commit [`94bfb36`](https://github.com/apache/spark/commit/94bfb36c4791772183a82cf4565fdea7ef7fb460).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #29065:
URL: https://github.com/apache/spark/pull/29065


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-688664335


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128383/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-664810904






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29065:
URL: https://github.com/apache/spark/pull/29065#discussion_r499577826



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
##########
@@ -580,23 +581,11 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
     val singleCol = df.select(col)
     val colType = singleCol.schema.head.dataType
 
-    require(colType == StringType || colType.isInstanceOf[IntegralType],
-      s"Bloom filter only supports string type and integral types, but got $colType.")
-
-    val updater: (BloomFilter, InternalRow) => Unit = colType match {
-      // For string type, we can get bytes of our `UTF8String` directly, and call the `putBinary`
-      // instead of `putString` to avoid unnecessary conversion.
-      case StringType => (filter, row) => filter.putBinary(row.getUTF8String(0).getBytes)
-      case ByteType => (filter, row) => filter.putLong(row.getByte(0))
-      case ShortType => (filter, row) => filter.putLong(row.getShort(0))
-      case IntegerType => (filter, row) => filter.putLong(row.getInt(0))
-      case LongType => (filter, row) => filter.putLong(row.getLong(0))
-      case _ =>
-        throw new IllegalArgumentException(
-          s"Bloom filter only supports string type and integral types, " +
-            s"and does not support type $colType."
-        )
-    }
+    require(colType.isInstanceOf[AtomicType],
+      s"Bloom filter only supports atomic types, but got ${colType.catalogString}.")
+
+    val updater: (BloomFilter, InternalRow) => Unit =
+      (filter, row) => BloomFilterUtils.putValue(filter, row.get(0, colType))

Review comment:
       I think this change can cause perf. regression because the pattern matching of `colType` happens every time `updater` called.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29065:
URL: https://github.com/apache/spark/pull/29065#discussion_r499571886



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala
##########
@@ -66,8 +64,58 @@ case class DynamicPruningSubquery(
       buildKeys.forall(_.references.subsetOf(buildQuery.outputSet)) &&
       pruningKey.dataType == buildKeys(broadcastKeyIndex).dataType
   }
+}
+
+case class DynamicPartitionPruningSubquery(
+     pruningKey: Expression,
+     buildQuery: LogicalPlan,
+     buildKeys: Seq[Expression],
+     broadcastKeyIndex: Int,
+     onlyInBroadcast: Boolean,
+    override val exprId: ExprId = NamedExpression.newExprId)
+  extends DynamicPruningSubquery(
+    pruningKey, buildQuery, buildKeys, broadcastKeyIndex, onlyInBroadcast, exprId) {
+
+  override def children: Seq[Expression] = Seq(pruningKey)
+
+  override def plan: LogicalPlan = buildQuery
+
+  override def nullable: Boolean = false
+
+  override def withNewPlan(plan: LogicalPlan): DynamicPartitionPruningSubquery =
+    copy(buildQuery = plan)
+
+  override def toString: String = s"dynamicpartitionpruning#${exprId.id} $conditionString"
+
+  override lazy val canonicalized: DynamicPruning = {
+    copy(
+      pruningKey = pruningKey.canonicalized,
+      buildQuery = buildQuery.canonicalized,
+      buildKeys = buildKeys.map(_.canonicalized),
+      exprId = ExprId(0))
+  }
+}
+
+case class DynamicShufflePruningSubquery(

Review comment:
       Looks `DynamicPartitionPruningSubquery` and `DynamicShufflePruningSubquery` are almost the same, so we need this new predicate? Could we add a value to represent a pruning type in a class field of `DynamicPruningSubquery` like this?
   ```
   case class DynamicPruningSubquery(
       pruningKey: Expression,
       buildQuery: LogicalPlan,
       buildKeys: Seq[Expression],
       broadcastKeyIndex: Int,
       onlyInBroadcast: Boolean,
       exprId: ExprId,
       pruningType: PruningType) <---- This?
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-664810432






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-671523141






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29065:
URL: https://github.com/apache/spark/pull/29065#discussion_r499572935



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -231,18 +288,31 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
 
             // there should be a partitioned table and a filter on the dimension table,
             // otherwise the pruning will not trigger
-            var partScan = getPartitionTableScan(l, left)
-            if (partScan.isDefined && canPruneLeft(joinType) &&
-                hasPartitionPruningFilter(right)) {
-              val hasBenefit = pruningHasBenefit(l, partScan.get, r, right)
-              newLeft = insertPredicate(l, newLeft, r, right, rightKeys, hasBenefit)
-            } else {
-              partScan = getPartitionTableScan(r, right)
-              if (partScan.isDefined && canPruneRight(joinType) &&
-                  hasPartitionPruningFilter(left) ) {
-                val hasBenefit = pruningHasBenefit(r, partScan.get, l, left)
-                newRight = insertPredicate(r, newRight, l, left, leftKeys, hasBenefit)
-              }
+            // Left side
+            getPartitionTableScan(l, left) match {
+              // partition pruning
+              case Some(partScan) if canPruneLeft(joinType) && hasDynamicPruningFilter(right) =>
+                val hasBenefit = pruningHasBenefit(l, partScan, r, right)
+                newLeft = insertPartitionPredicate(l, newLeft, r, right, rightKeys, hasBenefit)
+              // shuffle pruning
+              case None if conf.dynamicShufflePruningEnabled && canPruneLeft(joinType) &&
+                hasDynamicPruningFilter(right) && isDataFilter(l, left) &&
+                shufflePruningHasBenefit(l, left, r, right) =>
+                newLeft = insertShufflePredicate(l, newLeft, r, right, rightKeys)
+              case _ =>
+            }
+            // Right side
+            getPartitionTableScan(r, right) match {
+              // partition pruning
+              case Some(partScan) if canPruneRight(joinType) && hasDynamicPruningFilter(left) =>
+                val hasBenefit = pruningHasBenefit(r, partScan, l, left)
+                newRight = insertPartitionPredicate(r, newRight, l, left, leftKeys, hasBenefit)
+              // shuffle pruning
+              case None if conf.dynamicShufflePruningEnabled && canPruneRight(joinType) &&

Review comment:
       This new feature is enabled only if both `dynamicPartitionPruningEnabled` and `dynamicShufflePruningEnabled` are true?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-656873810


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125627/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-759864924


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-688647269


   **[Test build #128383 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128383/testReport)** for PR 29065 at commit [`1a8cc9b`](https://github.com/apache/spark/commit/1a8cc9b2718c820df7962c82b7ac1007c7712b1c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-688664321






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-688664321


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-688647916






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-671523141






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-656756847


   **[Test build #125627 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125627/testReport)** for PR 29065 at commit [`a47485b`](https://github.com/apache/spark/commit/a47485b2e60035fc760372d59b6bb663e2c0d6a7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-664810432






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-671371190


   Hi, @wangyum . The doc and PR looks reasonable. Is there a plan for further update because there is `[WIP]` still?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29065:
URL: https://github.com/apache/spark/pull/29065#discussion_r467918094



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala
##########
@@ -67,7 +67,58 @@ case class DynamicPruningSubquery(
       pruningKey.dataType == buildKeys(broadcastKeyIndex).dataType
   }
 
-  override def toString: String = s"dynamicpruning#${exprId.id} $conditionString"
+  override def toString: String = s"partitionpruning#${exprId.id} $conditionString"
+
+  override lazy val canonicalized: DynamicPruning = {
+    copy(
+      pruningKey = pruningKey.canonicalized,
+      buildQuery = buildQuery.canonicalized,
+      buildKeys = buildKeys.map(_.canonicalized),
+      exprId = ExprId(0))
+  }
+}
+
+/**
+ * The BloomFilterPruningSubquery expression is only used in join operations to prune one side of
+ * the join with a filter from the other side of the join. It is inserted in cases where shuffle
+ * pruning can be applied.
+ *
+ * @param pruningKey the filtering key of the plan to be pruned.
+ * @param buildQuery the build side of the join.
+ * @param buildKeys the join keys corresponding to the build side of the join
+ * @param broadcastKeyIndex the index of the filtering key collected from the broadcast
+ */
+case class BloomFilterPruningSubquery(
+    pruningKey: Expression,
+    buildQuery: LogicalPlan,
+    buildKeys: Seq[Expression],
+    broadcastKeyIndex: Int,
+    exprId: ExprId = NamedExpression.newExprId)
+  extends SubqueryExpression(buildQuery, Seq(pruningKey), exprId)
+    with DynamicPruning
+    with Unevaluable {

Review comment:
       https://github.com/databricks/scala-style-guide/blob/master/README.md#indent




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jovany-wang commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
jovany-wang commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-662255897


   Hi @wangyum , This is a nice PR to me. But some issues in my mind should be thrown here.
   
   I didn't do more perf between MinMax and Bloom, but in my personal sense, these may effect the perf of different cases.
   So how about making these things more general? like:
   ```
                             DynamicFilter
                                   |
                  Is the filtering key partitioned?
                           /                  \
                         Y                     N
                        /                       \
                 DPP filter         Choose a best filter for it. (from MinMax, Bloom or other filters such as index filter, etc)
                                          Note: Not all of the filters can be pushed to scan.
   ```
   That is just a rough idea, but the key point is to make DynamicFilter(or name it RuntimeFilter) more general(that means both of MinMaxFilter, BloomFilter and DPPFilter are DynamicFilter), so that it will be easy to get extended. 
   
   I have seen another proposal about RuntimeFilter(MinMax) before, so making things easy to be extended should be important as well as the perf result.
   
   Feel free to point my incorrect understanding out, thx.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-671361203


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-671362131






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-899032854


   Some real cases of our cluster.
   Case 1:
   Before this PR | After this PR
   -- | --
   ![image](https://user-images.githubusercontent.com/5399861/129475907-45ea3aa1-7be8-4a14-b0c5-da0bf7814f6d.png) | ![image](https://user-images.githubusercontent.com/5399861/129475856-2aa57f31-f43f-41eb-a0fa-2282d221a084.png)
   
   Case 2:
   Before this PR | After this PR
   -- | --
   ![image](https://user-images.githubusercontent.com/5399861/129476066-6c3dff0f-e271-47aa-bd96-9d9ece59611f.png) | ![image](https://user-images.githubusercontent.com/5399861/129476160-c297f19c-f3e1-49db-a94c-9a8369c30283.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-671362131






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jovany-wang edited a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
jovany-wang edited a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-662255897


   Hi @wangyum , This is a nice PR to me. But some issues in my mind should be thrown here.
   
   I didn't do more perf between MinMax and Bloom, but in my personal sense, these may effect the perf of different cases.
   So how about making these things more general? like:
   ```
                             DynamicFilter
                                   |
                  Is the filtering key partitioned?
                           /                  \
                         Y                     N
                        /                       \
                 DPP filter         Choose a best filter for it. (from MinMax, Bloom or other filters such as index filter, etc)
                                          Note: Not all of the filters can be pushed to scan.
   ```
   That is just a rough idea, but the key point is to make DynamicFilter(or name it RuntimeFilter) more general(that means both of MinMaxFilter, BloomFilter and DPPFilter are DynamicFilter), so that it will be easy to get extended. 
   
   I have seen another proposal about RuntimeFilter(MinMax) before, so making things easy to be extended should be important as well as the perf result. em, maybe it's hard to make it more extendable.
   
   Feel free to point my incorrect understanding out, thx.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-656873082


   **[Test build #125627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125627/testReport)** for PR 29065 at commit [`a47485b`](https://github.com/apache/spark/commit/a47485b2e60035fc760372d59b6bb663e2c0d6a7).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `case class RuntimeBloomFilterPruningSubquery(`
     * `case class BuildBloomFilter(`
     * `case class InBloomFilter(bloomFilterExp: Expression, value: Expression)`
     * `case class BloomFilterSubqueryExec(`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-688664173


   **[Test build #128383 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128383/testReport)** for PR 29065 at commit [`1a8cc9b`](https://github.com/apache/spark/commit/1a8cc9b2718c820df7962c82b7ac1007c7712b1c).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29065:
URL: https://github.com/apache/spark/pull/29065#discussion_r499574852



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PlanDynamicPruningFilters.scala
##########
@@ -73,18 +90,60 @@ case class PlanDynamicPruningFilters(sparkSession: SparkSession)
           val exchange = BroadcastExchangeExec(mode, executedPlan)
           val name = s"dynamicpruning#${exprId.id}"
           // place the broadcast adaptor for reusing the broadcast results on the probe side
-          val broadcastValues =
-            SubqueryBroadcastExec(name, broadcastKeyIndex, buildKeys, exchange)
-          DynamicPruningExpression(InSubqueryExec(value, broadcastValues, exprId))
+          val broadcastValues = SubqueryBroadcastExec(name, broadcastKeyIndex, buildKeys, exchange)
+          if (preferBloomFilter(buildKeys(broadcastKeyIndex), buildPlan)) {
+            DynamicPruningExpression(BloomFilterSubqueryExec(value, broadcastValues, exprId))

Review comment:
       Does this PR propose two things: 1. improving the existing part pruning by bloom filters and 2. implementing a new dynamic pruning strategy (shuffle pruning)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-656873796






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #29065:
URL: https://github.com/apache/spark/pull/29065


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29065: [WIP][SPARK-32268][SQL] Bloom Filter Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29065:
URL: https://github.com/apache/spark/pull/29065#issuecomment-671365814


   **[Test build #127281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127281/testReport)** for PR 29065 at commit [`94bfb36`](https://github.com/apache/spark/commit/94bfb36c4791772183a82cf4565fdea7ef7fb460).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org