You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2016/11/03 23:19:58 UTC
[jira] [Created] (HIVE-15122) Hive: Upcasting types should not
obscure stats (min/max/ndv)
Siddharth Seth created HIVE-15122:
-------------------------------------
Summary: Hive: Upcasting types should not obscure stats (min/max/ndv)
Key: HIVE-15122
URL: https://issues.apache.org/jira/browse/HIVE-15122
Project: Hive
Issue Type: Bug
Reporter: Siddharth Seth
A UDFToLong breaks PK/FK inferences and triggers mis-estimation of joins in LLAP.
Snippet from the bad plan.
{code}
| STAGE PLANS: |
| Stage: Stage-1 |
| Tez |
| DagId: hive_20161031222730_a700058f-78eb-40d6-a67d-43add60a50e2:6 |
| Edges: |
| Map 2 <- Map 1 (BROADCAST_EDGE) |
| Map 3 <- Map 2 (BROADCAST_EDGE) |
| Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE), Map 7 (CUSTOM_SIMPLE_EDGE), Map 8 (BROADCAST_EDGE), Map 9 (BROADCAST_EDGE) |
| Reducer 5 <- Reducer 4 (SIMPLE_EDGE) |
| Reducer 6 <- Reducer 5 (SIMPLE_EDGE) |
| DagName: |
| Vertices: |
| Map 1 |
| Map Operator Tree: |
| TableScan |
| alias: supplier |
| filterExpr: (s_suppkey is not null and s_nationkey is not null) (type: boolean) |
| Statistics: Num rows: 10000000 Data size: 160000000 Basic stats: COMPLETE Column stats: COMPLETE |
| Filter Operator |
| predicate: (s_suppkey is not null and s_nationkey is not null) (type: boolean) |
| Statistics: Num rows: 10000000 Data size: 160000000 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: s_suppkey (type: bigint), s_nationkey (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 10000000 Data size: 160000000 Basic stats: COMPLETE Column stats: COMPLETE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 10000000 Data size: 160000000 Basic stats: COMPLETE Column stats: COMPLETE |
| value expressions: _col1 (type: bigint) |
| Execution mode: vectorized, llap |
| LLAP IO: all inputs |
| Map 2 |
| Map Operator Tree: |
| TableScan |
| alias: lineitem |
| filterExpr: (l_suppkey is not null and l_orderkey is not null) (type: boolean) |
| Statistics: Num rows: 2285121364 Data size: 63983407882 Basic stats: COMPLETE Column stats: PARTIAL |
| Filter Operator |
| predicate: (l_suppkey is not null and l_orderkey is not null) (type: boolean) |
| Statistics: Num rows: 2285121364 Data size: 127966796384 Basic stats: COMPLETE Column stats: PARTIAL |
| Select Operator |
| expressions: l_orderkey (type: bigint), l_suppkey (type: int), l_extendedprice (type: double), l_discount (type: double), l_shipdate (type: date) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 2285121364 Data size: 127966796384 Basic stats: COMPLETE Column stats: PARTIAL |
| Map Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 _col0 (type: bigint) |
| 1 UDFToLong(_col1) (type: bigint) |
| outputColumnNames: _col1, _col2, _col4, _col5, _col6 |
| input vertices: |
| 0 Map 1 |
| Statistics: Num rows: 10000000 Data size: 880000000 Basic stats: COMPLETE Column stats: PARTIAL |
| Reduce Output Operator |
| key expressions: _col2 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col2 (type: bigint) |
| Statistics: Num rows: 10000000 Data size: 880000000 Basic stats: COMPLETE Column stats: PARTIAL |
| value expressions: _col1 (type: bigint), _col4 (type: double), _col5 (type: double), _col6 (type: date) |
| Execution mode: vectorized, llap |
| LLAP IO: all inputs |
| Map 3 |
| Map Operator Tree: |
| TableScan |
| alias: orders |
| filterExpr: (o_orderkey is not null and o_custkey is not null) (type: boolean) |
| Statistics: Num rows: 4318801126 Data size: 51825626753 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (o_orderkey is not null and o_custkey is not null) (type: boolean) |
| Statistics: Num rows: 4318801126 Data size: 51825626753 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: o_orderkey (type: int), o_custkey (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 4318801126 Data size: 51825626753 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 _col2 (type: bigint) |
| 1 UDFToLong(_col0) (type: bigint) |
| outputColumnNames: _col1, _col4, _col5, _col6, _col8 |
| input vertices: |
| 0 Map 2 |
| Statistics: Num rows: 4750681341 Data size: 57008190663 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col8 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col8 (type: bigint) |
| Statistics: Num rows: 4750681341 Data size: 57008190663 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: bigint), _col4 (type: double), _col5 (type: double), _col6 (type: date) |
| Execution mode: vectorized, llap |
| LLAP IO: all inputs |
| Map 7
{code}
Note the Map2 to Map3 output.
This causes a rather large join (120GB) to be categorized as a map-join.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)