You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/07 00:21:34 UTC

[GitHub] [spark] wangyum commented on pull request #28741: [WIP][SPARK-31919][SQL] Push down more predicates through Join

wangyum commented on pull request #28741:
URL: https://github.com/apache/spark/pull/28741#issuecomment-640136037


   It seems this solution cannot be fully optimized.
   PostgreSQL:
   ```
    Aggregate  (cost=77.33..77.34 rows=1 width=128)
      ->  Nested Loop  (cost=0.00..77.31 rows=1 width=32)
            Join Filter: (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
            ->  Nested Loop  (cost=0.00..67.18 rows=1 width=36)
                  Join Filter: ((store_sales.ss_addr_sk = customer_address.ca_address_sk) AND ((((customer_address.ca_state)::text = ANY ('{TX,OH,TX}'::text[])) AND (store_sales.ss_net_profit >=
   '100'::numeric) AND (store_sales.ss_net_profit <= '200'::numeric)) OR (((customer_address.ca_state)::text = ANY ('{OR,NM,KY}'::text[])) AND (store_sales.ss_net_profit >= '150'::numeric) AND (s
   tore_sales.ss_net_profit <= '300'::numeric)) OR (((customer_address.ca_state)::text = ANY ('{VA,TX,MS}'::text[])) AND (store_sales.ss_net_profit >= '50'::numeric) AND (store_sales.ss_net_profi
   t <= '250'::numeric))))
                  ->  Nested Loop  (cost=0.00..56.90 rows=1 width=54)
                        Join Filter: ((store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) AND ((((customer_demographics.cd_marital_status)::text = 'M'::text) AND ((customer_demographics.
   cd_education_status)::text = 'Advanced Degree'::text) AND (store_sales.ss_sales_price >= 100.00) AND (store_sales.ss_sales_price <= 150.00) AND (household_demographics.hd_dep_count = 3)) OR ((
   (customer_demographics.cd_marital_status)::text = 'S'::text) AND ((customer_demographics.cd_education_status)::text = 'College'::text) AND (store_sales.ss_sales_price >= 50.00) AND (store_sale
   s.ss_sales_price <= 100.00) AND (household_demographics.hd_dep_count = 1)) OR (((customer_demographics.cd_marital_status)::text = 'W'::text) AND ((customer_demographics.cd_education_status)::t
   ext = '2 yr Degree'::text) AND (store_sales.ss_sales_price >= 150.00) AND (store_sales.ss_sales_price <= 200.00) AND (household_demographics.hd_dep_count = 1))))
                        ->  Nested Loop  (cost=0.00..46.10 rows=1 width=76)
                              Join Filter: (store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk)
                              ->  Nested Loop  (cost=0.00..33.61 rows=1 width=76)
                                    Join Filter: (store_sales.ss_store_sk = store.s_store_sk)
                                    ->  Seq Scan on store_sales  (cost=0.00..23.60 rows=1 width=80)
                                          Filter: ((((ss_sales_price >= 100.00) AND (ss_sales_price <= 150.00)) OR ((ss_sales_price >= 50.00) AND (ss_sales_price <= 100.00)) OR ((ss_sales_price >
   = 150.00) AND (ss_sales_price <= 200.00))) AND (((ss_net_profit >= '100'::numeric) AND (ss_net_profit <= '200'::numeric)) OR ((ss_net_profit >= '150'::numeric) AND (ss_net_profit <= '300'::num
   eric)) OR ((ss_net_profit >= '50'::numeric) AND (ss_net_profit <= '250'::numeric))))
                                    ->  Seq Scan on store  (cost=0.00..10.00 rows=1 width=4)
                              ->  Seq Scan on household_demographics  (cost=0.00..12.45 rows=3 width=8)
                                    Filter: ((hd_dep_count = 3) OR (hd_dep_count = 1) OR (hd_dep_count = 1))
                        ->  Seq Scan on customer_demographics  (cost=0.00..10.75 rows=1 width=1036)
                              Filter: ((((cd_marital_status)::text = 'M'::text) AND ((cd_education_status)::text = 'Advanced Degree'::text)) OR (((cd_marital_status)::text = 'S'::text) AND ((cd_e
   ducation_status)::text = 'College'::text)) OR (((cd_marital_status)::text = 'W'::text) AND ((cd_education_status)::text = '2 yr Degree'::text)))
                  ->  Seq Scan on customer_address  (cost=0.00..10.24 rows=1 width=520)
                        Filter: (((ca_country)::text = 'United States'::text) AND (((ca_state)::text = ANY ('{TX,OH,TX}'::text[])) OR ((ca_state)::text = ANY ('{OR,NM,KY}'::text[])) OR ((ca_state
   )::text = ANY ('{VA,TX,MS}'::text[]))))
            ->  Seq Scan on date_dim  (cost=0.00..10.12 rows=1 width=4)
                  Filter: (d_year = 2001)
   (22 rows)
   ```
   
   After this PR(`set spark.sql.constraintPropagation.enabled=false` to ignore infer `IsNotNull`):
   ```
   *(7) HashAggregate(keys=[], functions=[avg(cast(ss_quantity#10 as bigint)), avg(UnscaledValue(ss_ext_sales_price#15)), avg(UnscaledValue(ss_ext_wholesale_cost#16)), sum(UnscaledValue(ss_ext_wholesale_cost#16))])
   +- Exchange SinglePartition, true, [id=#252]
      +- *(6) HashAggregate(keys=[], functions=[partial_avg(cast(ss_quantity#10 as bigint)), partial_avg(UnscaledValue(ss_ext_sales_price#15)), partial_avg(UnscaledValue(ss_ext_wholesale_cost#16)), partial_sum(UnscaledValue(ss_ext_wholesale_cost#16))])
         +- *(6) Project [ss_quantity#10, ss_ext_sales_price#15, ss_ext_wholesale_cost#16]
            +- *(6) BroadcastHashJoin [ss_hdemo_sk#5], [hd_demo_sk#61], Inner, BuildRight, (((((((cd_marital_status#54 = M) AND (cd_education_status#55 = Advanced Degree)) AND (ss_sales_price#13 >= 100.00)) AND (ss_sales_price#13 <= 150.00)) AND (hd_dep_count#64 = 3)) OR (((((cd_marital_status#54 = S) AND (cd_education_status#55 = College)) AND (ss_sales_price#13 >= 50.00)) AND (ss_sales_price#13 <= 100.00)) AND (hd_dep_count#64 = 1))) OR (((((cd_marital_status#54 = W) AND (cd_education_status#55 = 2 yr Degree)) AND (ss_sales_price#13 >= 150.00)) AND (ss_sales_price#13 <= 200.00)) AND (hd_dep_count#64 = 1)))
               :- *(6) Project [ss_hdemo_sk#5, ss_quantity#10, ss_sales_price#13, ss_ext_sales_price#15, ss_ext_wholesale_cost#16, cd_marital_status#54, cd_education_status#55]
               :  +- *(6) BroadcastHashJoin [ss_cdemo_sk#4], [cd_demo_sk#52], Inner, BuildRight, ((((((cd_marital_status#54 = M) AND (cd_education_status#55 = Advanced Degree)) AND (ss_sales_price#13 >= 100.00)) AND (ss_sales_price#13 <= 150.00)) OR ((((cd_marital_status#54 = S) AND (cd_education_status#55 = College)) AND (ss_sales_price#13 >= 50.00)) AND (ss_sales_price#13 <= 100.00))) OR ((((cd_marital_status#54 = W) AND (cd_education_status#55 = 2 yr Degree)) AND (ss_sales_price#13 >= 150.00)) AND (ss_sales_price#13 <= 200.00)))
               :     :- *(6) Project [ss_cdemo_sk#4, ss_hdemo_sk#5, ss_quantity#10, ss_sales_price#13, ss_ext_sales_price#15, ss_ext_wholesale_cost#16]
               :     :  +- *(6) BroadcastHashJoin [ss_sold_date_sk#0], [d_date_sk#79], Inner, BuildRight
               :     :     :- *(6) Project [ss_sold_date_sk#0, ss_cdemo_sk#4, ss_hdemo_sk#5, ss_quantity#10, ss_sales_price#13, ss_ext_sales_price#15, ss_ext_wholesale_cost#16]
               :     :     :  +- *(6) BroadcastHashJoin [ss_addr_sk#6], [ca_address_sk#66], Inner, BuildRight, ((((ca_state#74 IN (TX,OH) AND (ss_net_profit#22 >= 100.00)) AND (ss_net_profit#22 <= 200.00)) OR ((ca_state#74 IN (OR,NM,KY) AND (ss_net_profit#22 >= 150.00)) AND (ss_net_profit#22 <= 300.00))) OR ((ca_state#74 IN (VA,TX,MS) AND (ss_net_profit#22 >= 50.00)) AND (ss_net_profit#22 <= 250.00)))
               :     :     :     :- *(6) Project [ss_sold_date_sk#0, ss_cdemo_sk#4, ss_hdemo_sk#5, ss_addr_sk#6, ss_quantity#10, ss_sales_price#13, ss_ext_sales_price#15, ss_ext_wholesale_cost#16, ss_net_profit#22]
               :     :     :     :  +- *(6) BroadcastHashJoin [ss_store_sk#7], [s_store_sk#23], Inner, BuildRight
               :     :     :     :     :- *(6) Filter ((((ss_net_profit#22 >= 100.00) AND (ss_net_profit#22 <= 200.00)) OR ((ss_net_profit#22 >= 150.00) AND (ss_net_profit#22 <= 300.00))) OR ((ss_net_profit#22 >= 50.00) AND (ss_net_profit#22 <= 250.00)))
               :     :     :     :     :  +- *(6) ColumnarToRow
               :     :     :     :     :     +- FileScan parquet default.store_sales[ss_sold_date_sk#0,ss_cdemo_sk#4,ss_hdemo_sk#5,ss_addr_sk#6,ss_store_sk#7,ss_quantity#10,ss_sales_price#13,ss_ext_sales_price#15,ss_ext_wholesale_cost#16,ss_net_profit#22] Batched: true, DataFilters: [((((ss_net_profit#22 >= 100.00) AND (ss_net_profit#22 <= 200.00)) OR ((ss_net_profit#22 >= 150.0..., Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [Or(Or(And(GreaterThanOrEqual(ss_net_profit,100.00),LessThanOrEqual(ss_net_profit,200.00)),And(Gr..., ReadSchema: struct<ss_sold_date_sk:int,ss_cdemo_sk:int,ss_hdemo_sk:int,ss_addr_sk:int,ss_store_sk:int,ss_quan...
               :     :     :     :     +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#213]
               :     :     :     :        +- *(1) ColumnarToRow
               :     :     :     :           +- FileScan parquet default.store[s_store_sk#23] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<s_store_sk:int>
               :     :     :     +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#222]
               :     :     :        +- *(2) Project [ca_address_sk#66, ca_state#74]
               :     :     :           +- *(2) Filter ((ca_country#76 = United States) AND ((ca_state#74 IN (TX,OH) OR ca_state#74 IN (OR,NM,KY)) OR ca_state#74 IN (VA,TX,MS)))
               :     :     :              +- *(2) ColumnarToRow
               :     :     :                 +- FileScan parquet default.customer_address[ca_address_sk#66,ca_state#74,ca_country#76] Batched: true, DataFilters: [(ca_country#76 = United States), ((ca_state#74 IN (TX,OH) OR ca_state#74 IN (OR,NM,KY)) OR ca_st..., Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [EqualTo(ca_country,United States), Or(Or(In(ca_state, [TX,OH]),In(ca_state, [OR,NM,KY])),In(ca_s..., ReadSchema: struct<ca_address_sk:int,ca_state:string,ca_country:string>
               :     :     +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#231]
               :     :        +- *(3) Project [d_date_sk#79]
               :     :           +- *(3) Filter (d_year#85 = 2001)
               :     :              +- *(3) ColumnarToRow
               :     :                 +- FileScan parquet default.date_dim[d_date_sk#79,d_year#85] Batched: true, DataFilters: [(d_year#85 = 2001)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [EqualTo(d_year,2001)], ReadSchema: struct<d_date_sk:int,d_year:int>
               :     +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#238]
               :        +- *(4) ColumnarToRow
               :           +- FileScan parquet default.customer_demographics[cd_demo_sk#52,cd_marital_status#54,cd_education_status#55] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<cd_demo_sk:int,cd_marital_status:string,cd_education_status:string>
               +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#246]
                  +- *(5) Filter (((hd_dep_count#64 = 3) OR (hd_dep_count#64 = 1)) OR (hd_dep_count#64 = 1))
                     +- *(5) ColumnarToRow
                        +- FileScan parquet default.household_demographics[hd_demo_sk#61,hd_dep_count#64] Batched: true, DataFilters: [(((hd_dep_count#64 = 3) OR (hd_dep_count#64 = 1)) OR (hd_dep_count#64 = 1))], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [Or(Or(EqualTo(hd_dep_count,3),EqualTo(hd_dep_count,1)),EqualTo(hd_dep_count,1))], ReadSchema: struct<hd_demo_sk:int,hd_dep_count:int>
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org