You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Alessandro Solimando (Jira)" <ji...@apache.org> on 2022/06/10 13:52:00 UTC
[jira] [Updated] (HIVE-26313) Aggregate all column statistics into a single field
[ https://issues.apache.org/jira/browse/HIVE-26313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alessandro Solimando updated HIVE-26313:
----------------------------------------
Labels: breaking_change (was: )
> Aggregate all column statistics into a single field
> ---------------------------------------------------
>
> Key: HIVE-26313
> URL: https://issues.apache.org/jira/browse/HIVE-26313
> Project: Hive
> Issue Type: Improvement
> Components: Standalone Metastore, Statistics
> Affects Versions: 4.0.0-alpha-2
> Reporter: Alessandro Solimando
> Priority: Major
> Labels: breaking_change
>
> At the moment, column statistics tables in the metastore schema look like this (it's similar for _PART_COL_STATS_):
> {noformat}
> CREATE TABLE "APP"."TAB_COL_STATS"(
> "CAT_NAME" VARCHAR(256) NOT NULL,
> "DB_NAME" VARCHAR(128) NOT NULL,
> "TABLE_NAME" VARCHAR(256) NOT NULL,
> "COLUMN_NAME" VARCHAR(767) NOT NULL,
> "COLUMN_TYPE" VARCHAR(128) NOT NULL,
> "LONG_LOW_VALUE" BIGINT,
> "LONG_HIGH_VALUE" BIGINT,
> "DOUBLE_LOW_VALUE" DOUBLE,
> "DOUBLE_HIGH_VALUE" DOUBLE,
> "BIG_DECIMAL_LOW_VALUE" VARCHAR(4000),
> "BIG_DECIMAL_HIGH_VALUE" VARCHAR(4000),
> "NUM_DISTINCTS" BIGINT,
> "NUM_NULLS" BIGINT NOT NULL,
> "AVG_COL_LEN" DOUBLE,
> "MAX_COL_LEN" BIGINT,
> "NUM_TRUES" BIGINT,
> "NUM_FALSES" BIGINT,
> "LAST_ANALYZED" BIGINT,
> "CS_ID" BIGINT NOT NULL,
> "TBL_ID" BIGINT NOT NULL,
> "BIT_VECTOR" BLOB,
> "ENGINE" VARCHAR(128) NOT NULL
> );
> {noformat}
> The idea is to have a single blob named _STATISTICS_ to replace them, as follows:
> {noformat}
> CREATE TABLE "APP"."TAB_COL_STATS"(
> "CAT_NAME" VARCHAR(256) NOT NULL,
> "DB_NAME" VARCHAR(128) NOT NULL,
> "TABLE_NAME" VARCHAR(256) NOT NULL,
> "COLUMN_NAME" VARCHAR(767) NOT NULL,
> "COLUMN_TYPE" VARCHAR(128) NOT NULL,
> "STATISTICS" BLOB,
> "LAST_ANALYZED" BIGINT,
> "CS_ID" BIGINT NOT NULL,
> "TBL_ID" BIGINT NOT NULL,
> "ENGINE" VARCHAR(128) NOT NULL
> );
> {noformat}
> The _STATISTICS_ column could be the serialization of a Json-encoded string, which will be consumed in a "schema-on-read" fashion.
> At first at least the removed column statistics will be encoded in the _STATISTICS_ column, but since each "consumer" will read the portion of the schema it is interested into, multiple engines (see the _ENGINE_ column) can read and write statistics as they deem fit.
> Another advantage is that, if we plan to add more statistics in the future, we won't need to change the thrift interface for the metastore again.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)