You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/10 23:53:30 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue, #3789: Optimize benchmark q2 subquery filter

andygrove opened a new issue, #3789:
URL: https://github.com/apache/arrow-datafusion/issues/3789

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   @Dandandan noticed a difference between Spark and DataFusion logical plan for q2:
   
   Instead of moving `partsupp.ps_supplycost = __sq_1.__value` to a filter, Spark adds it to the inner join instead.
   
   Seems in the spark query this one is included in Inner join, instead of a filter:
   
   ```
   ((knownfloatingpointnormalized(normalizenanandzero(ps_supplycost#35)) = knownfloatingpointnormalized(normalizenanandzero(min(ps_supplycost)#124)))
   ```
   
   That will likely reduce the output size of the join.
   
   ## DataFusion
   
   ```
   Sort: supplier.s_acctbal DESC NULLS FIRST, nation.n_name ASC NULLS LAST, supplier.s_name ASC NULLS LAST, part.p_partkey ASC NULLS LAST
     Projection: supplier.s_acctbal, supplier.s_name, nation.n_name, part.p_partkey, part.p_mfgr, supplier.s_address, supplier.s_phone, supplier.s_comment
       Filter: partsupp.ps_supplycost = __sq_1.__value
         Inner Join: part.p_partkey = __sq_1.ps_partkey
           Inner Join: nation.n_regionkey = region.r_regionkey
             Inner Join: supplier.s_nationkey = nation.n_nationkey
               Inner Join: partsupp.ps_suppkey = supplier.s_suppkey
                 Inner Join: part.p_partkey = partsupp.ps_partkey
                   Filter: part.p_size = Int32(15) AND part.p_type LIKE Utf8("%BRASS")
                     TableScan: part projection=[p_partkey, p_mfgr, p_type, p_size], partial_filters=[part.p_size = Int32(15), part.p_type LIKE Utf8("%BRASS")]
                   TableScan: partsupp projection=[ps_partkey, ps_suppkey, ps_supplycost]
                 TableScan: supplier projection=[s_suppkey, s_name, s_address, s_nationkey, s_phone, s_acctbal, s_comment]
               TableScan: nation projection=[n_nationkey, n_name, n_regionkey]
             Filter: region.r_name = Utf8("EUROPE")
               TableScan: region projection=[r_regionkey, r_name], partial_filters=[region.r_name = Utf8("EUROPE")]
           Projection: partsupp.ps_partkey, MIN(partsupp.ps_supplycost) AS __value, alias=__sq_1
             Aggregate: groupBy=[[partsupp.ps_partkey]], aggr=[[MIN(partsupp.ps_supplycost)]]
               Inner Join: nation.n_regionkey = region.r_regionkey
                 Inner Join: supplier.s_nationkey = nation.n_nationkey
                   Inner Join: partsupp.ps_suppkey = supplier.s_suppkey
                     TableScan: partsupp projection=[ps_partkey, ps_suppkey, ps_supplycost]
                     TableScan: supplier projection=[s_suppkey, s_name, s_address, s_nationkey, s_phone, s_acctbal, s_comment]
                   TableScan: nation projection=[n_nationkey, n_name, n_regionkey]
                 Filter: region.r_name = Utf8("EUROPE")
                   TableScan: region projection=[r_regionkey, r_name], partial_filters=[region.r_name = Utf8("EUROPE")]
   ```
   
   ## Spark
   
   ```
   Sort [s_acctbal#23 DESC NULLS LAST, n_name#109 ASC NULLS FIRST, s_name#19 ASC NULLS FIRST, p_partkey#0L ASC NULLS FIRST], true
   +- Project [s_acctbal#23, s_name#19, n_name#109, p_partkey#0L, p_mfgr#2, s_address#20, s_phone#22, s_comment#24]
      +- Join Inner, (n_regionkey#110L = r_regionkey#116L)
         :- Project [p_partkey#0L, p_mfgr#2, s_name#19, s_address#20, s_phone#22, s_acctbal#23, s_comment#24, n_name#109, n_regionkey#110L]
         :  +- Join Inner, (s_nationkey#21L = n_nationkey#108L)
         :     :- Project [p_partkey#0L, p_mfgr#2, s_name#19, s_address#20, s_nationkey#21L, s_phone#22, s_acctbal#23, s_comment#24]
         :     :  +- Join Inner, (s_suppkey#18L = ps_suppkey#33L)
         :     :     :- Project [p_partkey#0L, p_mfgr#2, ps_suppkey#33L]
         :     :     :  +- Join Inner, ((ps_supplycost#35 = min(ps_supplycost)#124) AND (p_partkey#0L = ps_partkey#125L))
         :     :     :     :- Project [p_partkey#0L, p_mfgr#2, ps_suppkey#33L, ps_supplycost#35]
         :     :     :     :  +- Join Inner, (p_partkey#0L = ps_partkey#32L)
         :     :     :     :     :- Project [p_partkey#0L, p_mfgr#2]
         :     :     :     :     :  +- Filter (((isnotnull(p_size#5) AND isnotnull(p_type#4)) AND ((p_size#5 = 15) AND EndsWith(p_type#4, BRASS))) AND isnotnull(p_partkey#0L))
         :     :     :     :     :     +- Relation [p_partkey#0L,p_name#1,p_mfgr#2,p_brand#3,p_type#4,p_size#5,p_container#6,p_retailprice#7,p_comment#8] parquet
         :     :     :     :     +- Project [ps_partkey#32L, ps_suppkey#33L, ps_supplycost#35]
         :     :     :     :        +- Filter (isnotnull(ps_partkey#32L) AND (isnotnull(ps_supplycost#35) AND isnotnull(ps_suppkey#33L)))
         :     :     :     :           +- Relation [ps_partkey#32L,ps_suppkey#33L,ps_availqty#34,ps_supplycost#35,ps_comment#36] parquet
         :     :     :     +- Filter isnotnull(min(ps_supplycost)#124)
         :     :     :        +- Aggregate [ps_partkey#125L], [min(ps_supplycost#128) AS min(ps_supplycost)#124, ps_partkey#125L]
         :     :     :           +- Project [ps_partkey#125L, ps_supplycost#128]
         :     :     :              +- Join Inner, (n_regionkey#139L = r_regionkey#141L)
         :     :     :                 :- Project [ps_partkey#125L, ps_supplycost#128, n_regionkey#139L]
         :     :     :                 :  +- Join Inner, (s_nationkey#133L = n_nationkey#137L)
         :     :     :                 :     :- Project [ps_partkey#125L, ps_supplycost#128, s_nationkey#133L]
         :     :     :                 :     :  +- Join Inner, (s_suppkey#130L = ps_suppkey#126L)
         :     :     :                 :     :     :- Project [ps_partkey#125L, ps_suppkey#126L, ps_supplycost#128]
         :     :     :                 :     :     :  +- Filter (isnotnull(ps_suppkey#126L) AND isnotnull(ps_partkey#125L))
         :     :     :                 :     :     :     +- Relation [ps_partkey#125L,ps_suppkey#126L,ps_availqty#127,ps_supplycost#128,ps_comment#129] parquet
         :     :     :                 :     :     +- Project [s_suppkey#130L, s_nationkey#133L]
         :     :     :                 :     :        +- Filter (isnotnull(s_suppkey#130L) AND isnotnull(s_nationkey#133L))
         :     :     :                 :     :           +- Relation [s_suppkey#130L,s_name#131,s_address#132,s_nationkey#133L,s_phone#134,s_acctbal#135,s_comment#136] parquet
         :     :     :                 :     +- Project [n_nationkey#137L, n_regionkey#139L]
         :     :     :                 :        +- Filter (isnotnull(n_nationkey#137L) AND isnotnull(n_regionkey#139L))
         :     :     :                 :           +- Relation [n_nationkey#137L,n_name#138,n_regionkey#139L,n_comment#140] parquet
         :     :     :                 +- Project [r_regionkey#141L]
         :     :     :                    +- Filter ((isnotnull(r_name#142) AND (r_name#142 = EUROPE)) AND isnotnull(r_regionkey#141L))
         :     :     :                       +- Relation [r_regionkey#141L,r_name#142,r_comment#143] parquet
         :     :     +- Filter (isnotnull(s_suppkey#18L) AND isnotnull(s_nationkey#21L))
         :     :        +- Relation [s_suppkey#18L,s_name#19,s_address#20,s_nationkey#21L,s_phone#22,s_acctbal#23,s_comment#24] parquet
         :     +- Project [n_nationkey#108L, n_name#109, n_regionkey#110L]
         :        +- Filter (isnotnull(n_nationkey#108L) AND isnotnull(n_regionkey#110L))
         :           +- Relation [n_nationkey#108L,n_name#109,n_regionkey#110L,n_comment#111] parquet
         +- Project [r_regionkey#116L]
            +- Filter ((isnotnull(r_name#117) AND (r_name#117 = EUROPE)) AND isnotnull(r_regionkey#116L))
               +- Relation [r_regionkey#116L,r_name#117,r_comment#118] parquet
   ```
   
   **Describe the solution you'd like**
   Hopefully we can refine our existing rule or add a new rule to optimize this.
   
   **Describe alternatives you've considered**
   None
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan closed issue #3789: Optimize benchmark q2 subquery filter

Posted by GitBox <gi...@apache.org>.
Dandandan closed issue #3789: Optimize benchmark q2 subquery filter
URL: https://github.com/apache/arrow-datafusion/issues/3789


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org