You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/18 05:28:42 UTC

[GitHub] [spark] wangyum opened a new pull request, #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

wangyum opened a new pull request, #36588:
URL: https://github.com/apache/spark/pull/36588

   ### What changes were proposed in this pull request?
   
   Makes DPP support the pruning side has `Union`. For example:
   ```sql
   SELECT f.store_id,
          f.date_id,
          s.state_province
   FROM (SELECT 4 AS store_id,
                  date_id,
                  product_id
         FROM   fact_sk
         WHERE  date_id >= 1300
         UNION ALL
         SELECT   store_id,
                  date_id,
                  product_id
         FROM   fact_stats
         WHERE  date_id <= 1000) f
   JOIN dim_store s
   ON f.store_id = s.store_id
   WHERE s.country IN ('US', 'NL')
   ```
   After this PR:
   ![image](https://user-images.githubusercontent.com/5399861/168963668-1e5281ed-925b-4b5f-b701-020930c4336b.png)
   
   ### Why are the changes needed?
   
   Improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36588:
URL: https://github.com/apache/spark/pull/36588#issuecomment-1132514681

   A case from production:
   ![image](https://user-images.githubusercontent.com/5399861/169463931-65bfd0c0-1759-4f9d-8a0a-66b32463b76a.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r957919237


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/CleanupDynamicPruningFilters.scala:
##########
@@ -54,7 +54,8 @@ object CleanupDynamicPruningFilters extends Rule[LogicalPlan] with PredicateHelp
         val newCondition = condition.transformWithPruning(
           _.containsPattern(DYNAMIC_PRUNING_SUBQUERY)) {
           case dynamicPruning: DynamicPruningSubquery
-              if unnecessaryPruningKeys.contains(dynamicPruning.pruningKey) =>
+              if unnecessaryPruningKeys.contains(dynamicPruning.pruningKey) ||
+                dynamicPruning.pruningKey.references.isEmpty =>

Review Comment:
   how is this change related to Union?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r957965302


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/CleanupDynamicPruningFilters.scala:
##########
@@ -54,7 +54,8 @@ object CleanupDynamicPruningFilters extends Rule[LogicalPlan] with PredicateHelp
         val newCondition = condition.transformWithPruning(
           _.containsPattern(DYNAMIC_PRUNING_SUBQUERY)) {
           case dynamicPruning: DynamicPruningSubquery
-              if unnecessaryPruningKeys.contains(dynamicPruning.pruningKey) =>
+              if unnecessaryPruningKeys.contains(dynamicPruning.pruningKey) ||
+                dynamicPruning.pruningKey.references.isEmpty =>

Review Comment:
   This issue has fixed: https://github.com/apache/spark/pull/36724



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r958108361


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -140,6 +140,15 @@ trait PredicateHelper extends AliasHelper with Logging {
         findExpressionAndTrackLineageDown(replaceAlias(exp, aliasMap), a.child)
       case l: LeafNode if exp.references.subsetOf(l.outputSet) =>
         Some((exp, l))
+      case u: Union =>
+        val index = u.output.indexWhere(_.semanticEquals(exp))
+        if (index > -1) {
+          u.children
+            .flatMap(child => findExpressionAndTrackLineageDown(child.output(index), child))
+            .sortBy(_._2.stats.sizeInBytes).lastOption

Review Comment:
   and we only add DPP filter in one child of Union?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #36588:
URL: https://github.com/apache/spark/pull/36588#issuecomment-1231035576

   sorry for the late review, the change looks reasonable to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36588:
URL: https://github.com/apache/spark/pull/36588#issuecomment-1130002267

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union
URL: https://github.com/apache/spark/pull/36588


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r958046356


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -140,6 +140,15 @@ trait PredicateHelper extends AliasHelper with Logging {
         findExpressionAndTrackLineageDown(replaceAlias(exp, aliasMap), a.child)
       case l: LeafNode if exp.references.subsetOf(l.outputSet) =>
         Some((exp, l))
+      case u: Union =>
+        val index = u.output.indexWhere(_.semanticEquals(exp))
+        if (index > -1) {
+          u.children
+            .flatMap(child => findExpressionAndTrackLineageDown(child.output(index), child))
+            .sortBy(_._2.stats.sizeInBytes).lastOption

Review Comment:
   can you explain a bit more about this sorting part?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #36588:
URL: https://github.com/apache/spark/pull/36588#issuecomment-1374666826

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r958150650


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -140,6 +140,15 @@ trait PredicateHelper extends AliasHelper with Logging {
         findExpressionAndTrackLineageDown(replaceAlias(exp, aliasMap), a.child)
       case l: LeafNode if exp.references.subsetOf(l.outputSet) =>
         Some((exp, l))
+      case u: Union =>
+        val index = u.output.indexWhere(_.semanticEquals(exp))
+        if (index > -1) {
+          u.children
+            .flatMap(child => findExpressionAndTrackLineageDown(child.output(index), child))
+            .sortBy(_._2.stats.sizeInBytes).lastOption

Review Comment:
   After reading the code more, I don't think we have to use a heuristic here. We can let caller side to pass a callback function so that we can go to all the branches to find partitioned tables.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -140,6 +140,15 @@ trait PredicateHelper extends AliasHelper with Logging {
         findExpressionAndTrackLineageDown(replaceAlias(exp, aliasMap), a.child)
       case l: LeafNode if exp.references.subsetOf(l.outputSet) =>
         Some((exp, l))
+      case u: Union =>
+        val index = u.output.indexWhere(_.semanticEquals(exp))
+        if (index > -1) {
+          u.children
+            .flatMap(child => findExpressionAndTrackLineageDown(child.output(index), child))
+            .sortBy(_._2.stats.sizeInBytes).lastOption

Review Comment:
   After reading the code more, I don't think we have to use a heuristic here. We can let the caller side pass a callback function so that we can go to all the branches to find partitioned tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r958080120


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -140,6 +140,15 @@ trait PredicateHelper extends AliasHelper with Logging {
         findExpressionAndTrackLineageDown(replaceAlias(exp, aliasMap), a.child)
       case l: LeafNode if exp.references.subsetOf(l.outputSet) =>
         Some((exp, l))
+      case u: Union =>
+        val index = u.output.indexWhere(_.semanticEquals(exp))
+        if (index > -1) {
+          u.children
+            .flatMap(child => findExpressionAndTrackLineageDown(child.output(index), child))
+            .sortBy(_._2.stats.sizeInBytes).lastOption

Review Comment:
   In general, tables with larger sizes are more likely to be partition tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r958114095


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -140,6 +140,15 @@ trait PredicateHelper extends AliasHelper with Logging {
         findExpressionAndTrackLineageDown(replaceAlias(exp, aliasMap), a.child)
       case l: LeafNode if exp.references.subsetOf(l.outputSet) =>
         Some((exp, l))
+      case u: Union =>
+        val index = u.output.indexWhere(_.semanticEquals(exp))
+        if (index > -1) {
+          u.children
+            .flatMap(child => findExpressionAndTrackLineageDown(child.output(index), child))
+            .sortBy(_._2.stats.sizeInBytes).lastOption

Review Comment:
   For all children of Union expect the references is empty. For example: https://github.com/apache/spark/pull/36724



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r958157620


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -140,6 +140,15 @@ trait PredicateHelper extends AliasHelper with Logging {
         findExpressionAndTrackLineageDown(replaceAlias(exp, aliasMap), a.child)
       case l: LeafNode if exp.references.subsetOf(l.outputSet) =>
         Some((exp, l))
+      case u: Union =>
+        val index = u.output.indexWhere(_.semanticEquals(exp))
+        if (index > -1) {
+          u.children
+            .flatMap(child => findExpressionAndTrackLineageDown(child.output(index), child))
+            .sortBy(_._2.stats.sizeInBytes).lastOption

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r958147822


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -140,6 +140,15 @@ trait PredicateHelper extends AliasHelper with Logging {
         findExpressionAndTrackLineageDown(replaceAlias(exp, aliasMap), a.child)
       case l: LeafNode if exp.references.subsetOf(l.outputSet) =>
         Some((exp, l))
+      case u: Union =>
+        val index = u.output.indexWhere(_.semanticEquals(exp))
+        if (index > -1) {
+          u.children
+            .flatMap(child => findExpressionAndTrackLineageDown(child.output(index), child))
+            .sortBy(_._2.stats.sizeInBytes).lastOption

Review Comment:
   shall we do the same for `case other =>`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #36588:
URL: https://github.com/apache/spark/pull/36588#issuecomment-1229596412

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #36588:
URL: https://github.com/apache/spark/pull/36588#discussion_r958114095


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -140,6 +140,15 @@ trait PredicateHelper extends AliasHelper with Logging {
         findExpressionAndTrackLineageDown(replaceAlias(exp, aliasMap), a.child)
       case l: LeafNode if exp.references.subsetOf(l.outputSet) =>
         Some((exp, l))
+      case u: Union =>
+        val index = u.output.indexWhere(_.semanticEquals(exp))
+        if (index > -1) {
+          u.children
+            .flatMap(child => findExpressionAndTrackLineageDown(child.output(index), child))
+            .sortBy(_._2.stats.sizeInBytes).lastOption

Review Comment:
   For all child of Union expect the references is empty. For example: https://github.com/apache/spark/pull/36724



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union
URL: https://github.com/apache/spark/pull/36588


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36588: [SPARK-39217][SQL] Makes DPP support the pruning side has Union

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36588:
URL: https://github.com/apache/spark/pull/36588#issuecomment-1261925436

   @cloud-fan Do you have more comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org