You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Mostafa Mokhtar (JIRA)" <ji...@apache.org> on 2015/02/18 04:47:11 UTC

[jira] [Created] (HIVE-9714) Physical optimizer : Under estimation for left join outer condition

Mostafa Mokhtar created HIVE-9714:
-------------------------------------

             Summary: Physical optimizer : Under estimation for left join outer condition 
                 Key: HIVE-9714
                 URL: https://issues.apache.org/jira/browse/HIVE-9714
             Project: Hive
          Issue Type: Bug
          Components: Physical Optimizer
    Affects Versions: 0.14.0
            Reporter: Mostafa Mokhtar
            Assignee: Prasanth Jayachandran


For the query below which is a subset of TPC-DS Query 80, CBO joins catalog_sales with catalog_returns first although the CE of the join is relatively high.
The physical optimizer estimates for catalog_sales left out join catalog_returns 2,911 rows while the actual is 285,470,584 rows.

Database used was un-partitioned.

{code}
select count(*)
from
          catalog_sales
         ,warehouse
         ,date_dim
         ,time_dim
 	 ,ship_mode
     where
            catalog_sales.cs_warehouse_sk =  warehouse.w_warehouse_sk
        and catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
        and catalog_sales.cs_sold_time_sk = time_dim.t_time_sk
 	and catalog_sales.cs_ship_mode_sk = ship_mode.sm_ship_mode_sk
        and d_year = 2002
 	and t_time between 49530 AND 49530+28800 
 	and sm_carrier in ('DIAMOND','AIRBORNE')
     group by 
        w_warehouse_name
 	,w_warehouse_sq_ft
 	,w_city
 	,w_county
 	,w_state
 	,w_country
       ,d_year
{code}

Explain plan 
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Map 1 <- Map 2 (BROADCAST_EDGE)
        Map 3 <- Map 1 (BROADCAST_EDGE)
        Map 4 <- Map 3 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE), Map 7 (BROADCAST_EDGE)
        Reducer 5 <- Map 4 (SIMPLE_EDGE)
      DagName: mmokhtar_20150217223434_d0ab6fa9-a1a3-47a5-8138-ba7435d9aea5:4
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: catalog_sales
                  filterExpr: (((cs_sold_date_sk is not null and cs_catalog_page_sk is not null) and cs_item_sk is not null) and cs_promo_sk is not null) (type: boolean)
                  Statistics: Num rows: 286549727 Data size: 65825832570 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (((cs_sold_date_sk is not null and cs_catalog_page_sk is not null) and cs_item_sk is not null) and cs_promo_sk is not null) (type: boolean)
                    Statistics: Num rows: 285112475 Data size: 7974560516 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: cs_sold_date_sk (type: int), cs_catalog_page_sk (type: int), cs_item_sk (type: int), cs_promo_sk (type: int), cs_order_number (type: int), cs_ext_sales_price (type: float), cs_net_profit (type: float)
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
                      Statistics: Num rows: 285112475 Data size: 7974560516 Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Left Outer Join0 to 1
                        keys:
                          0 _col2 (type: int), _col4 (type: int)
                          1 _col0 (type: int), _col1 (type: int)
                        outputColumnNames: _col0, _col1, _col2, _col3, _col5, _col6, _col9, _col10
                        input vertices:
                          1 Map 2
                        Statistics: Num rows: 2911 Data size: 93152 Basic stats: COMPLETE Column stats: COMPLETE
                        Reduce Output Operator
                          key expressions: _col0 (type: int)
                          sort order: +
                          Map-reduce partition columns: _col0 (type: int)
                          Statistics: Num rows: 2911 Data size: 93152 Basic stats: COMPLETE Column stats: COMPLETE
                          value expressions: _col1 (type: int), _col2 (type: int), _col3 (type: int), _col5 (type: float), _col6 (type: float), _col9 (type: float), _col10 (type: float)
            Execution mode: vectorized
        Map 2
            Map Operator Tree:
                TableScan
                  alias: catalog_returns
                  filterExpr: cr_item_sk is not null (type: boolean)
                  Statistics: Num rows: 28798881 Data size: 5764329494 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: cr_item_sk is not null (type: boolean)
                    Statistics: Num rows: 28798881 Data size: 456171072 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: cr_item_sk (type: int), cr_order_number (type: int), cr_return_amount (type: float), cr_net_loss (type: float)
                      outputColumnNames: _col0, _col1, _col2, _col3
                      Statistics: Num rows: 28798881 Data size: 456171072 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int), _col1 (type: int)
                        sort order: ++
                        Map-reduce partition columns: _col0 (type: int), _col1 (type: int)
                        Statistics: Num rows: 28798881 Data size: 456171072 Basic stats: COMPLETE Column stats: COMPLETE
                        value expressions: _col2 (type: float), _col3 (type: float)
            Execution mode: vectorized
        Map 3
            Map Operator Tree:
                TableScan
                  alias: date_dim
                  filterExpr: (d_date BETWEEN 1998-08-04 AND 1998-09-04 and d_date_sk is not null) (type: boolean)
                  Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (d_date BETWEEN 1998-08-04 AND 1998-09-04 and d_date_sk is not null) (type: boolean)
                    Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: d_date_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col1, _col2, _col3, _col5, _col6, _col9, _col10
                        input vertices:
                          0 Map 1
                        Statistics: Num rows: 1456 Data size: 40768 Basic stats: COMPLETE Column stats: COMPLETE
                        Reduce Output Operator
                          key expressions: _col1 (type: int)
                          sort order: +
                          Map-reduce partition columns: _col1 (type: int)
                          Statistics: Num rows: 1456 Data size: 40768 Basic stats: COMPLETE Column stats: COMPLETE
                          value expressions: _col2 (type: int), _col3 (type: int), _col5 (type: float), _col6 (type: float), _col9 (type: float), _col10 (type: float)
            Execution mode: vectorized
        Map 4
            Map Operator Tree:
                TableScan
                  alias: catalog_page
                  filterExpr: cp_catalog_page_sk is not null (type: boolean)
                  Statistics: Num rows: 11718 Data size: 5400282 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: cp_catalog_page_sk is not null (type: boolean)
                    Statistics: Num rows: 11718 Data size: 1218672 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: cp_catalog_page_sk (type: int), cp_catalog_page_id (type: string)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 11718 Data size: 1218672 Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col1 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col2, _col3, _col5, _col6, _col9, _col10, _col14
                        input vertices:
                          0 Map 3
                        Statistics: Num rows: 1456 Data size: 180544 Basic stats: COMPLETE Column stats: COMPLETE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          keys:
                            0 _col2 (type: int)
                            1 _col0 (type: int)
                          outputColumnNames: _col3, _col5, _col6, _col9, _col10, _col14
                          input vertices:
                            1 Map 6
                          Statistics: Num rows: 486 Data size: 58320 Basic stats: COMPLETE Column stats: COMPLETE
                          Map Join Operator
                            condition map:
                                 Inner Join 0 to 1
                            keys:
                              0 _col3 (type: int)
                              1 _col0 (type: int)
                            outputColumnNames: _col5, _col6, _col9, _col10, _col14
                            input vertices:
                              1 Map 7
                            Statistics: Num rows: 243 Data size: 28188 Basic stats: COMPLETE Column stats: COMPLETE
                            Select Operator
                              expressions: _col14 (type: string), _col5 (type: float), COALESCE(_col9,0) (type: float), (_col6 - COALESCE(_col10,0)) (type: float)
                              outputColumnNames: _col0, _col1, _col2, _col3
                              Statistics: Num rows: 243 Data size: 28188 Basic stats: COMPLETE Column stats: COMPLETE
                              Group By Operator
                                aggregations: sum(_col1), sum(_col2), sum(_col3)
                                keys: _col0 (type: string)
                                mode: hash
                                outputColumnNames: _col0, _col1, _col2, _col3
                                Statistics: Num rows: 121 Data size: 15004 Basic stats: COMPLETE Column stats: COMPLETE
                                Reduce Output Operator
                                  key expressions: _col0 (type: string)
                                  sort order: +
                                  Map-reduce partition columns: _col0 (type: string)
                                  Statistics: Num rows: 121 Data size: 15004 Basic stats: COMPLETE Column stats: COMPLETE
                                  value expressions: _col1 (type: double), _col2 (type: double), _col3 (type: double)
            Execution mode: vectorized
        Map 6
            Map Operator Tree:
                TableScan
                  alias: item
                  filterExpr: ((i_current_price > 50.0) and i_item_sk is not null) (type: boolean)
                  Statistics: Num rows: 48000 Data size: 68732712 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ((i_current_price > 50.0) and i_item_sk is not null) (type: boolean)
                    Statistics: Num rows: 16000 Data size: 127832 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: i_item_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 16000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 16000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized
        Map 7
            Map Operator Tree:
                TableScan
                  alias: promotion
                  filterExpr: ((p_channel_tv = 'N') and p_promo_sk is not null) (type: boolean)
                  Statistics: Num rows: 450 Data size: 530848 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ((p_channel_tv = 'N') and p_promo_sk is not null) (type: boolean)
                    Statistics: Num rows: 225 Data size: 20025 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: p_promo_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 225 Data size: 900 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 225 Data size: 900 Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized
        Reducer 5
            Reduce Operator Tree:
              Group By Operator
                aggregations: sum(VALUE._col0), sum(VALUE._col1), sum(VALUE._col2)
                keys: KEY._col0 (type: string)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 121 Data size: 15004 Basic stats: COMPLETE Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 121 Data size: 15004 Basic stats: COMPLETE Column stats: COMPLETE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)