You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Mankalale, Bharath" <bm...@informatica.com> on 2016/08/18 19:18:14 UTC

Getting column statistics on paritioned Hive tables

Hi,

  I was trying to get column statistics for a partitioned hive table. Generally for a non-partitioned table I can run
`ANALYZE TABLE TABLENAME COMPUTE STATISTICS FOR COLUMNS`  and I can access the column statistics by
`DESCRIBE FORMATTED TABLENAME COLUMNNAME` from the hive client. This does not work for partitioned tables.
I see from the metastore the column statistics are stored in PART_COL_STATS and not in TAB_COL_STATS for partitioned tables. Is there any way to access the column statistics for the whole table?

I am using Hive 1.1.0

Thanks,
Bharath


RE: Getting column statistics on paritioned Hive tables

Posted by "Mankalale, Bharath" <bm...@informatica.com>.
Thanks. I think using metastore api is what I wanted.

Thanks,
Bharath

From: Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
Sent: Thursday, August 18, 2016 2:04 PM
To: user
Subject: Re: Getting column statistics on paritioned Hive tables

In general in Hive 2 you can get statistics for partitions by running:

hive> analyze table sales partition (year, month) compute statistics;
Partition oraclehadoop.sales{year=2000, month=10} stats: [numFiles=256, numRows=21034, totalSize=1651890, rawDataSize=6226064]
Partition oraclehadoop.sales{year=1999, month=4} stats: [numFiles=256, numRows=16512, totalSize=1533145, rawDataSize=4887552]
Partition oraclehadoop.sales{year=1999, month=8} stats: [numFiles=256, numRows=22979, totalSize=1697346, rawDataSize=6801784]
Partition oraclehadoop.sales{year=2001, month=8} stats: [numFiles=256, numRows=23879, totalSize=1744781, rawDataSize=7068184]
Partition oraclehadoop.sales{year=1998, month=2} stats: [numFiles=256, numRows=14149, totalSize=1438496, rawDataSize=4188104]
Partition oraclehadoop.sales{year=1999, month=7} stats: [numFiles=256, numRows=21648, totalSize=1657439, rawDataSize=6407808]
Partition oraclehadoop.sales{year=1999, month=5} stats: [numFiles=256, numRows=19733, totalSize=1623643, rawDataSize=5840968]
Partition oraclehadoop.sales{year=1999, month=1} stats: [numFiles=256, numRows=20637, totalSize=1638403, rawDataSize=6108552]

The partition statistics are shown for each partition as above.

Here not only there are partitions but also each partition is bucketed into 256 buckets.

Individual column stats can be obtained from the metadata table  part_col_stats

[Inline images 1]

If you are using ORC file, the statistics can be obtained from

 hive --orcfiledump --rowindex <FILE_PATH_ON_HDFS>


HTH



Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 18 August 2016 at 21:32, Gopal Vijayaraghavan <go...@apache.org>> wrote:
> Is there any way to access the column statistics for the whole table?

There's no column statistics for the whole table - the only way to get one
is to merge all the partition column statistics.

The metastore API actually exposes this (if you're looking for schema info
to read in a program).

https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/metastor
e/api/ThriftHiveMetastore.Processor.get_aggr_stats_for.html

+
https://github.com/apache/hive/blob/master/itests/hive-unit/src/test/java/o
rg/apache/hadoop/hive/metastore/hbase/TestHBaseAggrStatsCacheIntegration.ja
va#L184



Cheers,
Gopal




Re: Getting column statistics on paritioned Hive tables

Posted by Mich Talebzadeh <mi...@gmail.com>.
In general in Hive 2 you can get statistics for partitions by running:

hive> analyze table sales partition (year, month) compute statistics;
Partition oraclehadoop.sales{year=2000, month=10} stats: [numFiles=256,
numRows=21034, totalSize=1651890, rawDataSize=6226064]
Partition oraclehadoop.sales{year=1999, month=4} stats: [numFiles=256,
numRows=16512, totalSize=1533145, rawDataSize=4887552]
Partition oraclehadoop.sales{year=1999, month=8} stats: [numFiles=256,
numRows=22979, totalSize=1697346, rawDataSize=6801784]
Partition oraclehadoop.sales{year=2001, month=8} stats: [numFiles=256,
numRows=23879, totalSize=1744781, rawDataSize=7068184]
Partition oraclehadoop.sales{year=1998, month=2} stats: [numFiles=256,
numRows=14149, totalSize=1438496, rawDataSize=4188104]
Partition oraclehadoop.sales{year=1999, month=7} stats: [numFiles=256,
numRows=21648, totalSize=1657439, rawDataSize=6407808]
Partition oraclehadoop.sales{year=1999, month=5} stats: [numFiles=256,
numRows=19733, totalSize=1623643, rawDataSize=5840968]
Partition oraclehadoop.sales{year=1999, month=1} stats: [numFiles=256,
numRows=20637, totalSize=1638403, rawDataSize=6108552]

The partition statistics are shown for each partition as above.

Here not only there are partitions but also each partition is bucketed into
256 buckets.

Individual column stats can be obtained from the metadata table
part_col_stats

[image: Inline images 1]

If you are using ORC file, the statistics can be obtained from

 hive --orcfiledump --rowindex <FILE_PATH_ON_HDFS>


HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 August 2016 at 21:32, Gopal Vijayaraghavan <go...@apache.org> wrote:

> > Is there any way to access the column statistics for the whole table?
>
> There's no column statistics for the whole table - the only way to get one
> is to merge all the partition column statistics.
>
> The metastore API actually exposes this (if you're looking for schema info
> to read in a program).
>
> https://hive.apache.org/javadocs/r2.0.1/api/org/
> apache/hadoop/hive/metastor
> e/api/ThriftHiveMetastore.Processor.get_aggr_stats_for.html
>
> +
> https://github.com/apache/hive/blob/master/itests/hive-
> unit/src/test/java/o
> rg/apache/hadoop/hive/metastore/hbase/TestHBaseAggrStatsCacheIntegra
> tion.ja
> va#L184
>
>
>
> Cheers,
> Gopal
>
>
>
>

Re: Getting column statistics on paritioned Hive tables

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> Is there any way to access the column statistics for the whole table?

There's no column statistics for the whole table - the only way to get one
is to merge all the partition column statistics.

The metastore API actually exposes this (if you're looking for schema info
to read in a program).

https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/metastor
e/api/ThriftHiveMetastore.Processor.get_aggr_stats_for.html

+
https://github.com/apache/hive/blob/master/itests/hive-unit/src/test/java/o
rg/apache/hadoop/hive/metastore/hbase/TestHBaseAggrStatsCacheIntegration.ja
va#L184


 
Cheers,
Gopal