You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "BELUGA BEHR (JIRA)" <ji...@apache.org> on 2017/06/12 21:52:00 UTC
[jira] [Created] (HIVE-16887) Parquet rawDataSize Under Reported
BELUGA BEHR created HIVE-16887:
----------------------------------
Summary: Parquet rawDataSize Under Reported
Key: HIVE-16887
URL: https://issues.apache.org/jira/browse/HIVE-16887
Project: Hive
Issue Type: Improvement
Components: HiveServer2, Query Planning
Affects Versions: 2.1.1
Reporter: BELUGA BEHR
Priority: Minor
{code:sql}
CREATE TABLE `test_stats`(
`a` int,
`b` int,
`c` string,
`d` bigint) stored as parquet;
INSERT INTO test_stats VALUES (1,31,"This is a test", 4567890);
ANALYZE TABLE test_stats COMPUTE STATISTICS FOR COLUMNS;
EXPLAIN SELECT * FROM test_stats WHERE c IS NULL;
{code}
{code}
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test_stats
filterExpr: c is null (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: c is null (type: boolean)
Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: a (type: int), b (type: int), null (type: string), d (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
For Parquet tables, when Hive looks at Table stats, it sees one size for the table, when it looks at column stats, it sees another:
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
The row stats are much more accurate though I would expect them to be the same.
Total Table Size = (Num Rows) * (Average Row Size)
The rawDataSize is reported as 4 bytes. This is way under reporting the size of the data. Perhaps it is 4 bytes in Parquet format, but when this data loads into a Spark HashTable Sink or a Cache, this one row is going to be at least (4+4+8+14) 30 bytes.
In cases where we set {{hive.auto.convert.join.noconditionaltask.size}} to be 10mb and it is based off of the rawDataSize, in Spark when stats are enabled, we will require 114 bytes instead of the 4 bytes as reported in table stats (28.5x). You can imagine a case where the rawDataSize is reported as 10MB but the real amount of required memory to cache is 285MB! This may break an executor with limited memory or where multiple tables are being cached.
The Parquet SerDe should be reporting the total size of the data (uncompressed) so that we know exactly how much data we will be loading into memory, much like they do it here:
https://github.com/apache/hive/blob/6b6a00ffb0dae651ef407a99bab00d5e74f0d6aa/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L544
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)