You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2017/07/27 08:32:00 UTC

[jira] [Created] (HIVE-17182) Invalid statistics like "RAW DATA SIZE" info for parquet file

liyunzhang_intel created HIVE-17182:
---------------------------------------

             Summary: Invalid statistics like "RAW DATA SIZE" info for parquet file
                 Key: HIVE-17182
                 URL: https://issues.apache.org/jira/browse/HIVE-17182
             Project: Hive
          Issue Type: Bug
            Reporter: liyunzhang_intel


on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name            	data_type           	comment             
	 	 
ss_sold_time_sk     	bigint              	                    
ss_item_sk          	bigint              	                    
ss_customer_sk      	bigint              	                    
ss_cdemo_sk         	bigint              	                    
ss_hdemo_sk         	bigint              	                    
ss_addr_sk          	bigint              	                    
ss_store_sk         	bigint              	                    
ss_promo_sk         	bigint              	                    
ss_ticket_number    	bigint              	                    
ss_quantity         	int                 	                    
ss_wholesale_cost   	double              	                    
ss_list_price       	double              	                    
ss_sales_price      	double              	                    
ss_ext_discount_amt 	double              	                    
ss_ext_sales_price  	double              	                    
ss_ext_wholesale_cost	double              	                    
ss_ext_list_price   	double              	                    
ss_ext_tax          	double              	                    
ss_coupon_amt       	double              	                    
ss_net_paid         	double              	                    
ss_net_paid_inc_tax 	double              	                    
ss_net_profit       	double              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
ss_sold_date_sk     	bigint              	                    
	 	 
# Detailed Table Information	 	 
Database:           	tpcds_bin_partitioned_parquet_200	 
Owner:              	root                	 
CreateTime:         	Tue Jun 06 11:51:48 CST 2017	 
LastAccessTime:     	UNKNOWN             	 
Retention:          	0                   	 
Location:           	hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	COLUMN_STATS_ACCURATE	{\"BASIC_STATS\":\"true\"}
	numFiles            	2023                
	numPartitions       	1824                
	numRows             	575995635           
	rawDataSize         	12671903970         
	totalSize           	46465926745         
	transient_lastDdlTime	1496721108          
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{format}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G   /tmp/tpcds-generate/200/store_sales
{format} 
view the parquet file on hdfs
{format}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{format}

It seems that the rawDataSize is nearly 75G but in "describe formatted store_sales" command, it shows only 12G.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)