You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/02 02:36:57 UTC

[GitHub] [spark] wangyum opened a new pull request, #38071: [SPARK-36290][SQL] Pull out complex join condition

wangyum opened a new pull request, #38071:
URL: https://github.com/apache/spark/pull/38071

   ### What changes were proposed in this pull request?
   
   Similar to [`PullOutGroupingExpressions`](https://github.com/wangyum/spark/blob/7fd3f8f9ec55b364525407213ba1c631705686c5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala#L48). This PR adds a new rule(`PullOutComplexJoinCondition`) to pull out complex join condition. 
   
   ### Why are the changes needed?
   
   Pull out complex join condition has following advantage:
   1. Reduce the number of complex expression evaluations from 3 to 2 times if it is SortMergeJoin..
   2. Infer more additional filters, sometimes can avoid data skew. For example: https://github.com/apache/spark/pull/28642
   3. Avoid other rules also need to handle complex condition. For example: https://github.com/apache/spark/blob/dee7396204e2f6e7346e220867953fc74cd4253d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushPartialAggregationThroughJoin.scala#L325-L327
   
   For example:
   ```sql
   CREATE TABLE t1 (item_id BIGINT, event_type STRING, dt STRING) USING parquet PARTITIONED BY (dt);
   CREATE TABLE t2 (item_id BIGINT, cal_dt DATE) using parquet;
   set spark.sql.autoBroadcastJoinThreshold=-1;
   
   SELECT a.item_id,
          a.event_type
   FROM   t1 a
          INNER JOIN (SELECT DISTINCT cal_dt,
                                      item_id
                      FROM   t2) b
                  ON a.item_id = b.item_id
                     AND To_date(a.dt, 'yyyyMMdd') = b.cal_dt
   WHERE To_date(a.dt, 'yyyyMMdd') = date '2022-10-01';
   ```
   
   Before this PR:
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Project [item_id#28L, event_type#29]
      +- SortMergeJoin [cast(gettimestamp(dt#30, yyyyMMdd, TimestampType, Some(America/Los_Angeles), false) as date), item_id#28L], [cal_dt#32, item_id#31L], Inner
         :- Sort [cast(gettimestamp(dt#30, yyyyMMdd, TimestampType, Some(America/Los_Angeles), false) as date) ASC NULLS FIRST, item_id#28L ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(cast(gettimestamp(dt#30, yyyyMMdd, TimestampType, Some(America/Los_Angeles), false) as date), item_id#28L, 5), ENSURE_REQUIREMENTS, [plan_id=78]
         :     +- Filter isnotnull(item_id#28L)
         :        +- FileScan parquet spark_catalog.default.t1[item_id#28L,event_type#29,dt#30]
         +- Sort [cal_dt#32 ASC NULLS FIRST, item_id#31L ASC NULLS FIRST], false, 0
            +- HashAggregate(keys=[cal_dt#32, item_id#31L], functions=[])
               +- Exchange hashpartitioning(cal_dt#32, item_id#31L, 5), ENSURE_REQUIREMENTS, [plan_id=74]
                  +- HashAggregate(keys=[cal_dt#32, item_id#31L], functions=[])
                     +- Filter (isnotnull(item_id#31L) AND isnotnull(cal_dt#32))
                        +- FileScan parquet spark_catalog.default.t2[item_id#31L,cal_dt#32]
   ```
   
   After this PR:
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Project [item_id#28L, event_type#29]
      +- SortMergeJoin [CAST(gettimestamp(a.#37, item_id#28L], [cal_dt#32, item_id#31L], Inner
         :- Sort [CAST(gettimestamp(a.#37 ASC NULLS FIRST, item_id#28L ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(CAST(gettimestamp(a.#37, item_id#28L, 5), ENSURE_REQUIREMENTS, [plan_id=78]
         :     +- Project [item_id#28L, event_type#29, cast(gettimestamp(dt#30, yyyyMMdd, TimestampType, Some(America/Los_Angeles), false) as date) AS CAST(gettimestamp(a.#37]
         :        +- Filter isnotnull(item_id#28L)
         :           +- FileScan parquet spark_catalog.default.t1[item_id#28L,event_type#29,dt#30]
         +- Sort [cal_dt#32 ASC NULLS FIRST, item_id#31L ASC NULLS FIRST], false, 0
            +- HashAggregate(keys=[cal_dt#32, item_id#31L], functions=[])
               +- Exchange hashpartitioning(cal_dt#32, item_id#31L, 5), ENSURE_REQUIREMENTS, [plan_id=74]
                  +- HashAggregate(keys=[cal_dt#32, item_id#31L], functions=[])
                     +- Filter (((cal_dt#32 = 2022-10-01) AND isnotnull(item_id#31L)) AND isnotnull(cal_dt#32))
                        +- FileScan parquet spark_catalog.default.t2[item_id#31L,cal_dt#32]
   ```
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #38071: [SPARK-36290][SQL] Pull out complex join condition

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #38071:
URL: https://github.com/apache/spark/pull/38071#issuecomment-1427177392

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #38071: [SPARK-36290][SQL] Pull out complex join condition

Posted by GitBox <gi...@apache.org>.

wangyum commented on PR #38071:
URL: https://github.com/apache/spark/pull/38071#issuecomment-1302892265

   Case from production:
   
   
   
   
   
   Before | After
   -- | --
   ![image](https://user-images.githubusercontent.com/5399861/199874725-8e8b3111-bf57-47fe-b123-b95fe1f0ed01.png) | ![image](https://user-images.githubusercontent.com/5399861/199874933-d597101e-7cce-4292-b715-a0ea2218a563.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #38071: [SPARK-36290][SQL] Pull out complex join condition

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #38071: [SPARK-36290][SQL] Pull out complex join condition
URL: https://github.com/apache/spark/pull/38071


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org