You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Mankalale, Bharath" <bm...@informatica.com> on 2016/08/18 19:18:14 UTC
Getting column statistics on paritioned Hive tables
Hi,
I was trying to get column statistics for a partitioned hive table. Generally for a non-partitioned table I can run
`ANALYZE TABLE TABLENAME COMPUTE STATISTICS FOR COLUMNS` and I can access the column statistics by
`DESCRIBE FORMATTED TABLENAME COLUMNNAME` from the hive client. This does not work for partitioned tables.
I see from the metastore the column statistics are stored in PART_COL_STATS and not in TAB_COL_STATS for partitioned tables. Is there any way to access the column statistics for the whole table?
I am using Hive 1.1.0
Thanks,
Bharath
RE: Getting column statistics on paritioned Hive tables
Posted by "Mankalale, Bharath" <bm...@informatica.com>.
Thanks. I think using metastore api is what I wanted.
Thanks,
Bharath
From: Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
Sent: Thursday, August 18, 2016 2:04 PM
To: user
Subject: Re: Getting column statistics on paritioned Hive tables
In general in Hive 2 you can get statistics for partitions by running:
hive> analyze table sales partition (year, month) compute statistics;
Partition oraclehadoop.sales{year=2000, month=10} stats: [numFiles=256, numRows=21034, totalSize=1651890, rawDataSize=6226064]
Partition oraclehadoop.sales{year=1999, month=4} stats: [numFiles=256, numRows=16512, totalSize=1533145, rawDataSize=4887552]
Partition oraclehadoop.sales{year=1999, month=8} stats: [numFiles=256, numRows=22979, totalSize=1697346, rawDataSize=6801784]
Partition oraclehadoop.sales{year=2001, month=8} stats: [numFiles=256, numRows=23879, totalSize=1744781, rawDataSize=7068184]
Partition oraclehadoop.sales{year=1998, month=2} stats: [numFiles=256, numRows=14149, totalSize=1438496, rawDataSize=4188104]
Partition oraclehadoop.sales{year=1999, month=7} stats: [numFiles=256, numRows=21648, totalSize=1657439, rawDataSize=6407808]
Partition oraclehadoop.sales{year=1999, month=5} stats: [numFiles=256, numRows=19733, totalSize=1623643, rawDataSize=5840968]
Partition oraclehadoop.sales{year=1999, month=1} stats: [numFiles=256, numRows=20637, totalSize=1638403, rawDataSize=6108552]
The partition statistics are shown for each partition as above.
Here not only there are partitions but also each partition is bucketed into 256 buckets.
Individual column stats can be obtained from the metadata table part_col_stats
[Inline images 1]
If you are using ORC file, the statistics can be obtained from
hive --orcfiledump --rowindex <FILE_PATH_ON_HDFS>
HTH
Dr Mich Talebzadeh
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
On 18 August 2016 at 21:32, Gopal Vijayaraghavan <go...@apache.org>> wrote:
> Is there any way to access the column statistics for the whole table?
There's no column statistics for the whole table - the only way to get one
is to merge all the partition column statistics.
The metastore API actually exposes this (if you're looking for schema info
to read in a program).
https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/metastor
e/api/ThriftHiveMetastore.Processor.get_aggr_stats_for.html
+
https://github.com/apache/hive/blob/master/itests/hive-unit/src/test/java/o
rg/apache/hadoop/hive/metastore/hbase/TestHBaseAggrStatsCacheIntegration.ja
va#L184
Cheers,
Gopal
Re: Getting column statistics on paritioned Hive tables
Posted by Mich Talebzadeh <mi...@gmail.com>.
In general in Hive 2 you can get statistics for partitions by running:
hive> analyze table sales partition (year, month) compute statistics;
Partition oraclehadoop.sales{year=2000, month=10} stats: [numFiles=256,
numRows=21034, totalSize=1651890, rawDataSize=6226064]
Partition oraclehadoop.sales{year=1999, month=4} stats: [numFiles=256,
numRows=16512, totalSize=1533145, rawDataSize=4887552]
Partition oraclehadoop.sales{year=1999, month=8} stats: [numFiles=256,
numRows=22979, totalSize=1697346, rawDataSize=6801784]
Partition oraclehadoop.sales{year=2001, month=8} stats: [numFiles=256,
numRows=23879, totalSize=1744781, rawDataSize=7068184]
Partition oraclehadoop.sales{year=1998, month=2} stats: [numFiles=256,
numRows=14149, totalSize=1438496, rawDataSize=4188104]
Partition oraclehadoop.sales{year=1999, month=7} stats: [numFiles=256,
numRows=21648, totalSize=1657439, rawDataSize=6407808]
Partition oraclehadoop.sales{year=1999, month=5} stats: [numFiles=256,
numRows=19733, totalSize=1623643, rawDataSize=5840968]
Partition oraclehadoop.sales{year=1999, month=1} stats: [numFiles=256,
numRows=20637, totalSize=1638403, rawDataSize=6108552]
The partition statistics are shown for each partition as above.
Here not only there are partitions but also each partition is bucketed into
256 buckets.
Individual column stats can be obtained from the metadata table
part_col_stats
[image: Inline images 1]
If you are using ORC file, the statistics can be obtained from
hive --orcfiledump --rowindex <FILE_PATH_ON_HDFS>
HTH
Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On 18 August 2016 at 21:32, Gopal Vijayaraghavan <go...@apache.org> wrote:
> > Is there any way to access the column statistics for the whole table?
>
> There's no column statistics for the whole table - the only way to get one
> is to merge all the partition column statistics.
>
> The metastore API actually exposes this (if you're looking for schema info
> to read in a program).
>
> https://hive.apache.org/javadocs/r2.0.1/api/org/
> apache/hadoop/hive/metastor
> e/api/ThriftHiveMetastore.Processor.get_aggr_stats_for.html
>
> +
> https://github.com/apache/hive/blob/master/itests/hive-
> unit/src/test/java/o
> rg/apache/hadoop/hive/metastore/hbase/TestHBaseAggrStatsCacheIntegra
> tion.ja
> va#L184
>
>
>
> Cheers,
> Gopal
>
>
>
>
Re: Getting column statistics on paritioned Hive tables
Posted by Gopal Vijayaraghavan <go...@apache.org>.
> Is there any way to access the column statistics for the whole table?
There's no column statistics for the whole table - the only way to get one
is to merge all the partition column statistics.
The metastore API actually exposes this (if you're looking for schema info
to read in a program).
https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/metastor
e/api/ThriftHiveMetastore.Processor.get_aggr_stats_for.html
+
https://github.com/apache/hive/blob/master/itests/hive-unit/src/test/java/o
rg/apache/hadoop/hive/metastore/hbase/TestHBaseAggrStatsCacheIntegration.ja
va#L184
Cheers,
Gopal