You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/25 11:43:03 UTC

[GitHub] [spark] beliefer opened a new pull request, #38388: [WIP][SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter

beliefer opened a new pull request, #38388:
URL: https://github.com/apache/spark/pull/38388

   ### What changes were proposed in this pull request?
   Currently, if the creation side of bloom filter could be broadcasted, Spark cannot inject a bloom filter or `InSunquery` filter into the application side. I think at first we thought broadcast was cheaper than runtime filter. This behavior is contrary to the join has a shuffle below broadcast hash join described in the design document.
   ![image](https://user-images.githubusercontent.com/8486025/194822331-7a36b018-dfd9-48ce-a867-7c30a6914791.png)
   
   In fact, we can inject bloom filter which could reuse the broadcast exchange and improve performance.
   
   As we know, bloom filter ensures creation side is small enough with the code below.
   ```
       // Skip if the filter creation side is too big
       if (filterCreationSidePlan.stats.sizeInBytes > conf.runtimeFilterCreationSideThreshold) {
         return filterApplicationSidePlan
       }
   ```
   `InSunquery` filter ensures creation side is small enough with the code below.
   ```
       val aggregate = Aggregate(Seq(alias), Seq(alias), filterCreationSidePlan)
       if (!canBroadcastBySize(aggregate, conf)) {
         // Skip the InSubquery filter if the size of `aggregate` is beyond broadcast join threshold,
         // i.e., the semi-join will be a shuffled join, which is not worthwhile.
         return filterApplicationSidePlan
       }
   ```
   ### Why are the changes needed?
   
   1. Relax the restrictions of broadcast join on bloom filter, so as the runtime filter applicable to more scenarios.
   2. Reuse the broadcast exchange for bloom filter.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   Just update the inner implementation.
   
   
   ### How was this patch tested?
   New tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #38388: [SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #38388: [SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter
URL: https://github.com/apache/spark/pull/38388


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #38388: [WIP][SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter

Posted by GitBox <gi...@apache.org>.
beliefer commented on PR #38388:
URL: https://github.com/apache/spark/pull/38388#issuecomment-1291771515

   ping @somani cc @cloud-fan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] chenminghua8 commented on pull request #38388: [WIP][SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter

Posted by GitBox <gi...@apache.org>.
chenminghua8 commented on PR #38388:
URL: https://github.com/apache/spark/pull/38388#issuecomment-1290458599

   I also think this pull request is necessary, and there are other conditions that prevent the Bloom filter or InSunquery filter from being injected that should probably be adjusted as well. For example, either end of the join has already injected the Bloom filter or InSunquery filter, and the Bloom filter or InSunquery filter cannot be injected at the other end.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #38388: [SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38388:
URL: https://github.com/apache/spark/pull/38388#issuecomment-1528287576

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38388:
URL: https://github.com/apache/spark/pull/38388#issuecomment-1815534318

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #38388: [SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter

Posted by GitBox <gi...@apache.org>.
beliefer commented on PR #38388:
URL: https://github.com/apache/spark/pull/38388#issuecomment-1296429564

   cc @maryannxue @sigmod 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #38388: [WIP][SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter

Posted by GitBox <gi...@apache.org>.
beliefer commented on PR #38388:
URL: https://github.com/apache/spark/pull/38388#issuecomment-1291770897

   failed test case is unrelated to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #38388: [SPARK-40909][SQL] Reuse the broadcast exchange for bloom filter
URL: https://github.com/apache/spark/pull/38388


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org