You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/13 20:40:56 UTC

[GitHub] [arrow] andygrove edited a comment on pull request #8748: ARROW-10703: [Rust] [DataFusion] Make join not collect on every right part

andygrove edited a comment on pull request #8748:
URL: https://github.com/apache/arrow/pull/8748#issuecomment-744065261


   I am also seeing a slowdown.
   Master:
   ```
   Running benchmarks with the following options: BenchmarkOpt { query: 12, debug: true, iterations: 1, concurrency: 12, batch_size: 4096, path: "/mnt/tpch/tbl-sf1/", file_format: "tbl", mem_table: false }
   Logical plan:
   Sort: #l_shipmode ASC NULLS FIRST
     Aggregate: groupBy=[[#l_shipmode]], aggr=[[SUM(CASE WHEN #o_orderpriority Eq Utf8("1-URGENT") Or #o_orderpriority Eq Utf8("2-HIGH") THEN Int64(1) ELSE Int64(0) END) AS high_line_count, SUM(CASE WHEN #o_orderpriority NotEq Utf8("1-URGENT") And #o_orderpriority NotEq Utf8("2-HIGH") THEN Int64(1) ELSE Int64(0) END) AS low_line_count]]
       Filter: #l_shipmode Eq Utf8("MAIL") Or #l_shipmode Eq Utf8("SHIP") And #l_commitdate Lt #l_receiptdate And #l_shipdate Lt #l_commitdate And #l_receiptdate GtEq Utf8("1994-01-01") And #l_receiptdate Lt Utf8("1995-01-01")
         Join: l_orderkey = o_orderkey
           TableScan: lineitem projection=None
           TableScan: orders projection=None
   Optimized logical plan:
   Sort: #l_shipmode ASC NULLS FIRST
     Aggregate: groupBy=[[#l_shipmode]], aggr=[[SUM(CASE WHEN #o_orderpriority Eq Utf8("1-URGENT") Or #o_orderpriority Eq Utf8("2-HIGH") THEN Int64(1) ELSE Int64(0) END) AS high_line_count, SUM(CASE WHEN #o_orderpriority NotEq Utf8("1-URGENT") And #o_orderpriority NotEq Utf8("2-HIGH") THEN Int64(1) ELSE Int64(0) END) AS low_line_count]]
       Join: l_orderkey = o_orderkey
         Filter: #l_shipmode Eq Utf8("MAIL") Or #l_shipmode Eq Utf8("SHIP") And #l_commitdate Lt #l_receiptdate And #l_shipdate Lt #l_commitdate And #l_receiptdate GtEq Utf8("1994-01-01") And #l_receiptdate Lt Utf8("1995-01-01")
           TableScan: lineitem projection=Some([0, 10, 11, 12, 14])
         TableScan: orders projection=Some([0, 5])
   +------------+-----------------+----------------+
   | l_shipmode | high_line_count | low_line_count |
   +------------+-----------------+----------------+
   | MAIL       | 6202            | 9324           |
   | SHIP       | 6200            | 9262           |
   +------------+-----------------+----------------+
   Query 12 iteration 0 took 6708 ms
   ```
   
   This PR
   
   ```
   Running benchmarks with the following options: BenchmarkOpt { query: 12, debug: true, iterations: 1, concurrency: 12, batch_size: 4096, path: "/mnt/tpch/tbl-sf1/", file_format: "tbl", mem_table: false }
   Logical plan:
   Aggregate: groupBy=[[#l_shipmode]], aggr=[[SUM(Int32(1)) AS high_line_count, SUM(Int32(0)) AS low_line_count]]
     Join: l_orderkey = o_orderkey
       Filter: #l_receiptdate Lt Utf8("1995-01-01")
         Filter: #l_receiptdate GtEq Utf8("1994-01-01")
           Filter: #l_shipdate Lt #l_commitdate
             Filter: #l_commitdate Lt #l_receiptdate
               Filter: #l_shipmode Eq Utf8("MAIL") Or #l_shipmode Eq Utf8("SHIP")
                 TableScan: lineitem projection=None
       TableScan: orders projection=None
   Optimized logical plan:
   Aggregate: groupBy=[[#l_shipmode]], aggr=[[SUM(Int32(1)) AS high_line_count, SUM(Int32(0)) AS low_line_count]]
     Join: l_orderkey = o_orderkey
       TableScan: lineitem projection=Some([0, 10, 11, 12, 14])
       TableScan: orders projection=Some([0])
   a774ae7f3b83dd3127cf40709f9ac9d5c8d98e25+------------+-----------------+----------------+
   | l_shipmode | high_line_count | low_line_count |
   +------------+-----------------+----------------+
   | MAIL       | 857399          | 0              |
   | TRUCK      | 856997          | 0              |
   | SHIP       | 858036          | 0              |
   | REG AIR    | 856867          | 0              |
   | AIR        | 858103          | 0              |
   | FOB        | 857323          | 0              |
   | RAIL       | 856484          | 0              |
   +------------+-----------------+----------------+
   Query 12 iteration 0 took 277062 ms
   ```
   
   Note that the results are incorrect as well. It looks the filters are not being applied correctly.
   
   Edit: The filters are being dropped from the optimized plan.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org