You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Mostafa Mokhtar (JIRA)" <ji...@apache.org> on 2014/10/31 04:33:35 UTC
[jira] [Commented] (HIVE-8671) Overflow in estimate row count and
data size with fetch column stats
[ https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191274#comment-14191274 ]
Mostafa Mokhtar commented on HIVE-8671:
---------------------------------------
This is where the bug is :
Since we hit an overflow before data size is set to Log.MAX_VALUE, then when we add 1 to that it overflows and reducers ends up being 1
{code}
public static int estimateReducers(long totalInputFileSize, long bytesPerReducer,
int maxReducers, boolean powersOfTwo) {
int reducers = (int) ((totalInputFileSize + bytesPerReducer - 1) / bytesPerReducer);
reducers = Math.max(1, reducers);
reducers = Math.min(maxReducers, reducers);
{code}
> Overflow in estimate row count and data size with fetch column stats
> --------------------------------------------------------------------
>
> Key: HIVE-8671
> URL: https://issues.apache.org/jira/browse/HIVE-8671
> Project: Hive
> Issue Type: Bug
> Components: Physical Optimizer
> Affects Versions: 0.14.0
> Reporter: Mostafa Mokhtar
> Assignee: Prasanth J
> Priority: Critical
> Fix For: 0.14.0
>
>
> Overflow in row counts and data size for several TPC-DS queries.
> Interestingly the operators which have overflow end up running with a small parallelism.
> For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
> {code}
> Reducer 2
> Reduce Operator Tree:
> Group By Operator
> aggregations: sum(VALUE._col0)
> keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
> Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col3 (type: string), _col3 (type: string)
> sort order: ++
> Map-reduce partition columns: _col3 (type: string)
> Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
> value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double)
> Execution mode: vectorized
> {code}
> {code}
> VERTEX TOTAL_TASKS DURATION_SECONDS CPU_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS
> Map 1 62 26.41 1,779,510 211,978,502 60,628,390
> Map 5 1 4.28 6,950 138,098 138,098
> Map 6 1 2.44 3,910 31 31
> Reducer 2 2 22.69 61,320 60,628,390 69,182
> Reducer 3 1 2.63 3,910 69,182 100
> Reducer 4 1 1.01 1,180 100 100
> {code}
> Query
> {code}
> explain
> select i_item_desc
> ,i_category
> ,i_class
> ,i_current_price
> ,i_item_id
> ,sum(ws_ext_sales_price) as itemrevenue
> ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
> (partition by i_class) as revenueratio
> from
> web_sales
> ,item
> ,date_dim
> where
> web_sales.ws_item_sk = item.i_item_sk
> and item.i_category in ('Jewelry', 'Sports', 'Books')
> and web_sales.ws_sold_date_sk = date_dim.d_date_sk
> and date_dim.d_date between '2001-01-12' and '2001-02-11'
> group by
> i_item_id
> ,i_item_desc
> ,i_category
> ,i_class
> ,i_current_price
> order by
> i_category
> ,i_class
> ,i_item_id
> ,i_item_desc
> ,revenueratio
> limit 100
> {code}
> Explain
> {code}
> STAGE PLANS:
> Stage: Stage-1
> Tez
> Edges:
> Map 1 <- Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
> Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
> DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: web_sales
> filterExpr: ws_item_sk is not null (type: boolean)
> Statistics: Num rows: 21594638446 Data size: 2850189889652 Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: ws_item_sk is not null (type: boolean)
> Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: ws_item_sk (type: int), ws_ext_sales_price (type: float), ws_sold_date_sk (type: int)
> outputColumnNames: _col0, _col1, _col2
> Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {_col0} {_col1}
> 1
> keys:
> 0 _col2 (type: int)
> 1 _col0 (type: int)
> outputColumnNames: _col0, _col1
> input vertices:
> 1 Map 6
> Statistics: Num rows: 24145061366 Data size: 193160490928 Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {_col1}
> 1 {_col1} {_col2} {_col3} {_col4} {_col5}
> keys:
> 0 _col0 (type: int)
> 1 _col0 (type: int)
> outputColumnNames: _col1, _col6, _col7, _col8, _col9, _col10
> input vertices:
> 1 Map 5
> Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: _col6 (type: string), _col7 (type: string), _col10 (type: string), _col9 (type: string), _col8 (type: float), _col1 (type: float)
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
> Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE
> Group By Operator
> aggregations: sum(_col5)
> keys: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float)
> mode: hash
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
> Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float)
> sort order: +++++
> Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float)
> Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE
> value expressions: _col5 (type: double)
> Execution mode: vectorized
> Map 5
> Map Operator Tree:
> TableScan
> alias: item
> filterExpr: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean)
> Statistics: Num rows: 462000 Data size: 663862160 Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean)
> Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: i_item_sk (type: int), i_item_id (type: string), i_item_desc (type: string), i_current_price (type: float), i_class (type: string), i_category (type: string)
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
> Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE
> value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: float), _col4 (type: string), _col5 (type: string)
> Execution mode: vectorized
> Map 6
> Map Operator Tree:
> TableScan
> alias: date_dim
> filterExpr: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean)
> Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean)
> Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: d_date_sk (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
> Group By Operator
> keys: _col0 (type: int)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
> Dynamic Partitioning Event Operator
> Target Input: web_sales
> Partition key expr: ws_sold_date_sk
> Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
> Target column: ws_sold_date_sk
> Target Vertex: Map 1
> Execution mode: vectorized
> Reducer 2
> Reduce Operator Tree:
> Group By Operator
> aggregations: sum(VALUE._col0)
> keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
> Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col3 (type: string), _col3 (type: string)
> sort order: ++
> Map-reduce partition columns: _col3 (type: string)
> Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE
> value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double)
> Execution mode: vectorized
> Reducer 3
> Reduce Operator Tree:
> Extract
> Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE
> PTF Operator
> Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col0 (type: string), _col5 (type: double), ((_col5 * 100.0) / _wcol0) (type: double)
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
> Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col1 (type: string), _col2 (type: string), _col4 (type: string), _col0 (type: string), _col6 (type: double)
> sort order: +++++
> Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE
> TopN Hash Memory Usage: 0.04
> value expressions: _col3 (type: float), _col5 (type: double)
> Reducer 4
> Reduce Operator Tree:
> Select Operator
> expressions: KEY.reducesinkkey3 (type: string), KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string), VALUE._col0 (type: float), KEY.reducesinkkey2 (type: string), VALUE._col1 (type: double), KEY.reducesinkkey4 (type: double)
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
> Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE
> Limit
> Number of rows: 100
> Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Execution mode: vectorized
> Stage: Stage-0
> Fetch Operator
> limit: 100
> Processor Tree:
> ListSink
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)