You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Mostafa Mokhtar (JIRA)" <ji...@apache.org> on 2015/04/02 19:35:52 UTC

[jira] [Created] (HIVE-10194) CBO (Calcite Return Path): Q94 generates cross product

Mostafa Mokhtar created HIVE-10194:
--------------------------------------

             Summary: CBO (Calcite Return Path): Q94 generates cross product 
                 Key: HIVE-10194
                 URL: https://issues.apache.org/jira/browse/HIVE-10194
             Project: Hive
          Issue Type: Sub-task
            Reporter: Mostafa Mokhtar
            Assignee: Jesus Camacho Rodriguez


Query 
{code}
SELECT count(distinct ws_order_number) as order_count,
               sum(ws_ext_ship_cost) as total_shipping_cost,
               sum(ws_net_profit) as total_net_profit
FROM web_sales ws1
JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk)
JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk)
JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk)
LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_number
                               FROM web_sales ws2 JOIN web_sales ws3
                               ON (ws2.ws_order_number = ws3.ws_order_number)
                               WHERE ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk
			) ws_wh1
ON (ws1.ws_order_number = ws_wh1.ws_order_number)
LEFT OUTER JOIN web_returns wr1 ON (ws1.ws_order_number = wr1.wr_order_number)
WHERE d.d_date between '1999-05-01' and '1999-07-01' and
               ca.ca_state = 'TX' and
               s.web_company_name = 'pri' and
               wr1.wr_order_number is null
limit 100
{code}

Plan
{code}
OK
Time taken: 0.23 seconds
Warning: Map Join MAPJOIN[83][bigTable=ws1] in task 'Map 2' is a cross product
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Map 2 <- Map 1 (BROADCAST_EDGE)
        Map 8 <- Reducer 4 (BROADCAST_EDGE)
        Reducer 3 <- Map 2 (SIMPLE_EDGE), Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE), Map 7 (SIMPLE_EDGE)
        Reducer 4 <- Map 10 (SIMPLE_EDGE), Reducer 3 (SIMPLE_EDGE)
        Reducer 9 <- Map 8 (SIMPLE_EDGE)
      DagName: mmokhtar_20150402132417_1bc8688b-59a0-4909-82a4-b9d386065bbd:3
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: ws1
                  filterExpr: (((ws_ship_addr_sk = ws_order_number) and (ws_ship_date_sk <> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: boolean)
                  Statistics: Num rows: 143966864 Data size: 33110363004 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (((ws_ship_addr_sk = ws_order_number) and (ws_ship_date_sk <> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: boolean)
                    Statistics: Num rows: 71974471 Data size: 1151483592 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: ws_ship_addr_sk (type: int)
                      outputColumnNames: _col1
                      Statistics: Num rows: 71974471 Data size: 287862044 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        sort order:
                        Statistics: Num rows: 71974471 Data size: 287862044 Basic stats: COMPLETE Column stats: COMPLETE
                        value expressions: _col1 (type: int)
            Execution mode: vectorized
        Map 10
            Map Operator Tree:
                TableScan
                  alias: wr1
                  Statistics: Num rows: 13749816 Data size: 2585240312 Basic stats: COMPLETE Column stats: COMPLETE
                  Reduce Output Operator
                    key expressions: wr_order_number (type: int)
                    sort order: +
                    Map-reduce partition columns: wr_order_number (type: int)
                    Statistics: Num rows: 13749816 Data size: 2585240312 Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized
        Map 2
            Map Operator Tree:
                TableScan
                  alias: ws1
                  Statistics: Num rows: 143966864 Data size: 33110363004 Basic stats: COMPLETE Column stats: COMPLETE
                  Map Join Operator
                    condition map:
                         Inner Join 0 to 1
                    keys:
                      0
                      1
                    outputColumnNames: _col1
                    input vertices:
                      0 Map 1
                    Statistics: Num rows: 5180969438964472 Data size: 20723877755857888 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: _col1 (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 5180969438964472 Data size: 20723877755857888 Basic stats: COMPLETE Column stats: COMPLETE
                      Group By Operator
                        keys: _col0 (type: int)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 73333928460636 Data size: 293335713842544 Basic stats: COMPLETE Column stats: COMPLETE
                        Reduce Output Operator
                          key expressions: _col0 (type: int)
                          sort order: +
                          Map-reduce partition columns: _col0 (type: int)
                          Statistics: Num rows: 73333928460636 Data size: 293335713842544 Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized
        Map 5
            Map Operator Tree:
                TableScan
                  alias: ca
                  filterExpr: ((ca_state = 'TX') and ca_address_sk is not null) (type: boolean)
                  Statistics: Num rows: 800000 Data size: 811903688 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ((ca_state = 'TX') and ca_address_sk is not null) (type: boolean)
                    Statistics: Num rows: 15686 Data size: 1411740 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: ca_address_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 15686 Data size: 62744 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 15686 Data size: 62744 Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized
        Map 6
            Map Operator Tree:
                TableScan
                  alias: d
                  filterExpr: (d_date BETWEEN '1999-05-01' AND '1999-07-01' and d_date_sk is not null) (type: boolean)
                  Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (d_date BETWEEN '1999-05-01' AND '1999-07-01' and d_date_sk is not null) (type: boolean)
                    Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: d_date_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized
        Map 7
            Map Operator Tree:
                TableScan
                  alias: ws1
                  filterExpr: (((ws_order_number is not null and ws_ship_addr_sk is not null) and ws_ship_date_sk is not null) and ws_web_site_sk is not null) (type: boolean)
                  Statistics: Num rows: 143966864 Data size: 33110363004 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (((ws_order_number is not null and ws_ship_addr_sk is not null) and ws_ship_date_sk is not null) and ws_web_site_sk is not null) (type: boolean)
                    Statistics: Num rows: 143912892 Data size: 3453621540 Basic stats: COMPLETE Column stats: COMPLETE
                    Reduce Output Operator
                      key expressions: ws_order_number (type: int)
                      sort order: +
                      Map-reduce partition columns: ws_order_number (type: int)
                      Statistics: Num rows: 143912892 Data size: 3453621540 Basic stats: COMPLETE Column stats: COMPLETE
                      value expressions: ws_ship_date_sk (type: int), ws_ship_addr_sk (type: int), ws_web_site_sk (type: int), ws_ext_ship_cost (type: float), ws_net_profit (type: float)
            Execution mode: vectorized
        Map 8
            Map Operator Tree:
                TableScan
                  alias: s
                  filterExpr: ((web_company_name = 'pri') and web_site_sk is not null) (type: boolean)
                  Statistics: Num rows: 38 Data size: 70614 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ((web_company_name = 'pri') and web_site_sk is not null) (type: boolean)
                    Statistics: Num rows: 5 Data size: 460 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: web_site_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col30 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col31, _col32, _col33
                        input vertices:
                          0 Reducer 4
                        Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                        Select Operator
                          expressions: _col31 (type: int), _col32 (type: float), _col33 (type: float)
                          outputColumnNames: _col0, _col1, _col2
                          Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                          Group By Operator
                            aggregations: count(DISTINCT _col0), sum(_col1), sum(_col2)
                            keys: _col0 (type: int)
                            mode: hash
                            outputColumnNames: _col0, _col1, _col2, _col3
                            Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              sort order: +
                              Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                              TopN Hash Memory Usage: 0.04
                              value expressions: _col2 (type: double), _col3 (type: double)
            Execution mode: vectorized
        Reducer 3
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Left Semi Join 0 to 1
                keys:
                  0 ws_order_number (type: int)
                  1 _col0 (type: int)
                outputColumnNames: _col2, _col11, _col13, _col17, _col28, _col33
                Statistics: Num rows: 718857633877870 Data size: 17252583213068880 Basic stats: COMPLETE Column stats: COMPLETE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col11 (type: int)
                    1 _col0 (type: int)
                  outputColumnNames: _col2, _col13, _col17, _col28, _col33
                  input vertices:
                    1 Map 5
                  Statistics: Num rows: 14094999486464 Data size: 281899989729280 Basic stats: COMPLETE Column stats: COMPLETE
                  Map Join Operator
                    condition map:
                         Inner Join 0 to 1
                    keys:
                      0 _col2 (type: int)
                      1 _col0 (type: int)
                    outputColumnNames: _col13, _col17, _col28, _col33
                    input vertices:
                      1 Map 6
                    Statistics: Num rows: 7047403274240 Data size: 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: _col13 (type: int), _col17 (type: int), _col28 (type: float), _col33 (type: float)
                      outputColumnNames: _col2, _col3, _col4, _col5
                      Statistics: Num rows: 7047403274240 Data size: 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col3 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col3 (type: int)
                        Statistics: Num rows: 7047403274240 Data size: 112758452387840 Basic stats: COMPLETE Column stats: COMPLETE
                        value expressions: _col2 (type: int), _col4 (type: float), _col5 (type: float)
        Reducer 4
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Right Outer Join0 to 1
                keys:
                  0 wr_order_number (type: int)
                  1 _col3 (type: int)
                outputColumnNames: _col13, _col30, _col31, _col32, _col33
                Statistics: Num rows: 12100482980189 Data size: 242009659603780 Basic stats: COMPLETE Column stats: COMPLETE
                Filter Operator
                  predicate: _col13 is null (type: boolean)
                  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: COMPLETE
                  Reduce Output Operator
                    key expressions: _col30 (type: int)
                    sort order: +
                    Map-reduce partition columns: _col30 (type: int)
                    Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: COMPLETE
                    value expressions: _col31 (type: int), _col32 (type: float), _col33 (type: float)
        Reducer 9
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(DISTINCT KEY._col0:0._col0), sum(VALUE._col1), sum(VALUE._col2)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
                Limit
                  Number of rows: 100
                  Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 100
      Processor Tree:
        ListSink
{code}

Logical plan 
{code}
HiveSort(fetch=[100]): rowcount = 1.0, cumulative cost = {4.594004456323317E8 rows, 5.882012320482085E9 cpu, 9.353802456E12 io}, id = 1080
  HiveAggregate(group=[{}], agg#0=[count(DISTINCT $0)], agg#1=[sum($1)], agg#2=[sum($2)]): rowcount = 1.0, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1078
    HiveProject($f0=[$3], $f1=[$4], $f2=[$5]): rowcount = 464.4485504122314, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1076
      HiveFilter(condition=[isnull($12)]): rowcount = 464.4485504122314, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1074
        HiveProject(ws_ship_date_sk=[$1], ws_ship_addr_sk=[$2], ws_web_site_sk=[$3], ws_order_number=[$4], ws_ext_ship_cost=[$5], ws_net_profit=[$6], ca_address_sk=[$7], ca_state=[$8], web_site_sk=[$11], web_company_name=[$12], d_date_sk=[$9], d_date=[$10], wr_order_number=[$0]): rowcount = 1229395.3129411766, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1185
          HiveJoin(condition=[=($3, $11)], joinType=[inner], joinAlgorithm=[map_join], cost=[{1.3749818235294119E7 rows, 1.3749820470588237E7 cpu, 0.0 io}]): rowcount = 1229395.3129411766, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1183
            HiveJoin(condition=[=($4, $0)], joinType=[right], joinAlgorithm=[map_join], cost=[{1.3749817E7 rows, 2.7499633E7 cpu, 2.47496688E10 io}]): rowcount = 1.3749816E7, cumulative cost = {4.456506153970376E8 rows, 5.868262499011497E9 cpu, 9.353802456E12 io}, id = 1181
              HiveTableScan(table=[[tpcds_bin_orc_200.web_returns]]): rowcount = 1.3749816E7, cumulative cost = {0}, id = 974
              HiveJoin(condition=[=($0, $8)], joinType=[inner], joinAlgorithm=[map_join], cost=[{204.39703763146147 rows, 205.39703763146147 cpu, 0.0 io}]): rowcount = 1.0, cumulative cost = {4.319007983970376E8 rows, 5.840762866011497E9 cpu, 9.3290527872E12 io}, id = 1179
                HiveJoin(condition=[=($1, $6)], joinType=[inner], joinAlgorithm=[map_join], cost=[{1.43966865E8 rows, 1.43966866E8 cpu, 0.0 io}]): rowcount = 203.39703763146147, cumulative cost = {4.31900594E8 rows, 5.840762660614459E9 cpu, 9.3290527872E12 io}, id = 1110
                  SemiJoin(condition=[=($3, $6)], joinType=[inner]): rowcount = 1.43966864E8, cumulative cost = {2.87933729E8 rows, 5.696795794614459E9 cpu, 9.3290527872E12 io}, id = 1058
                    HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = 1.43966864E8, cumulative cost = {0}, id = 958
                    HiveProject(ws_order_number=[$1]): rowcount = 1.655760644524185E11, cumulative cost = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1056
                      HiveFilter(condition=[<>($0, $2)]): rowcount = 1.655760644524185E11, cumulative cost = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1054
                        HiveJoin(condition=[=($1, $3)], joinType=[inner], joinAlgorithm=[common_join], cost=[{2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}]): rowcount = 1.6564734127740878E11, cumulative cost = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1103
                          HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = 1.43966864E8, cumulative cost = {0}, id = 958
                          HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount = 1.43966864E8, cumulative cost = {0}, id = 958
                  HiveFilter(condition=[=($1, 'TX')]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1060
                    HiveTableScan(table=[[tpcds_bin_orc_200.customer_address]]): rowcount = 800000.0, cumulative cost = {0}, id = 965
                HiveFilter(condition=[between(false, $1, '1999-05-01', '1999-07-01')]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1068
                  HiveTableScan(table=[[tpcds_bin_orc_200.date_dim]]): rowcount = 73049.0, cumulative cost = {0}, id = 971
            HiveFilter(condition=[=($1, 'pri')]): rowcount = 2.235294117647059, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1064
              HiveTableScan(table=[[tpcds_bin_orc_200.web_site]]): rowcount = 38.0, cumulative cost = {0}, id = 968
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)