You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2017/07/27 08:32:00 UTC
[jira] [Created] (HIVE-17182) Invalid statistics like "RAW DATA
SIZE" info for parquet file
liyunzhang_intel created HIVE-17182:
---------------------------------------
Summary: Invalid statistics like "RAW DATA SIZE" info for parquet file
Key: HIVE-17182
URL: https://issues.apache.org/jira/browse/HIVE-17182
Project: Hive
Issue Type: Bug
Reporter: liyunzhang_intel
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name data_type comment
ss_sold_time_sk bigint
ss_item_sk bigint
ss_customer_sk bigint
ss_cdemo_sk bigint
ss_hdemo_sk bigint
ss_addr_sk bigint
ss_store_sk bigint
ss_promo_sk bigint
ss_ticket_number bigint
ss_quantity int
ss_wholesale_cost double
ss_list_price double
ss_sales_price double
ss_ext_discount_amt double
ss_ext_sales_price double
ss_ext_wholesale_cost double
ss_ext_list_price double
ss_ext_tax double
ss_coupon_amt double
ss_net_paid double
ss_net_paid_inc_tax double
ss_net_profit double
# Partition Information
# col_name data_type comment
ss_sold_date_sk bigint
# Detailed Table Information
Database: tpcds_bin_partitioned_parquet_200
Owner: root
CreateTime: Tue Jun 06 11:51:48 CST 2017
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 2023
numPartitions 1824
numRows 575995635
rawDataSize 12671903970
totalSize 46465926745
transient_lastDdlTime 1496721108
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{format}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G /tmp/tpcds-generate/200/store_sales
{format}
view the parquet file on hdfs
{format}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{format}
It seems that the rawDataSize is nearly 75G but in "describe formatted store_sales" command, it shows only 12G.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)