You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Victor Delépine (Jira)" <ji...@apache.org> on 2022/07/12 09:23:00 UTC

[jira] [Commented] (SPARK-39753) Broadcast joins should pushdown join constraints as Filter to the larger relation

    [ https://issues.apache.org/jira/browse/SPARK-39753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565387#comment-17565387 ] 

Victor Delépine commented on SPARK-39753:
-----------------------------------------

cc [~ndimiduk] since you created the original ticket 

> Broadcast joins should pushdown join constraints as Filter to the larger relation
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-39753
>                 URL: https://issues.apache.org/jira/browse/SPARK-39753
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0, 3.2.1, 3.3.0
>            Reporter: Victor Delépine
>            Priority: Major
>
> SPARK-19609 was bulk-closed a while ago, but not fixed. I've decided to re-open it here for more visibility, since I believe this bug has a major impact and that fixing it could drastically improve the performance of many pipelines.
> Allow me to paste the initial description again here:
> _For broadcast inner-joins, where the smaller relation is known to be small enough to materialize on a worker, the set of values for all join columns is known and fits in memory. Spark should translate these values into a {{Filter}} pushed down to the datasource. The common join condition of equality, i.e. {{{}lhs.a == rhs.a{}}}, can be written as an {{a in ...}} clause. An example of pushing such filters is already present in the form of {{IsNotNull}} filters via_ [~sameerag]{_}'s work on SPARK-12957 subtasks.{_}
> _This optimization could even work when the smaller relation does not fit entirely in memory. This could be done by partitioning the smaller relation into N pieces, applying this predicate pushdown for each piece, and unioning the results._
>  
> Essentially, when doing a Broadcast join, the smaller side can be used to filter down the bigger side before performing the join. As of today, the join will reads all partitions of the bigger side, without pruning partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org