You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Alessandro Solimando (Jira)" <ji...@apache.org> on 2022/06/10 13:53:00 UTC

[jira] [Updated] (HIVE-26313) Aggregate all column statistics into a single field in metastore

     [ https://issues.apache.org/jira/browse/HIVE-26313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alessandro Solimando updated HIVE-26313:
----------------------------------------
    Summary: Aggregate all column statistics into a single field in metastore  (was: Aggregate all column statistics into a single field)

> Aggregate all column statistics into a single field in metastore
> ----------------------------------------------------------------
>
>                 Key: HIVE-26313
>                 URL: https://issues.apache.org/jira/browse/HIVE-26313
>             Project: Hive
>          Issue Type: Improvement
>          Components: Standalone Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Priority: Major
>              Labels: breaking_change
>
> At the moment, column statistics tables in the metastore schema look like this (it's similar for _PART_COL_STATS_):
> {noformat}
> CREATE TABLE "APP"."TAB_COL_STATS"(
>     "CAT_NAME" VARCHAR(256) NOT NULL,
>     "DB_NAME" VARCHAR(128) NOT NULL,
>     "TABLE_NAME" VARCHAR(256) NOT NULL,
>     "COLUMN_NAME" VARCHAR(767) NOT NULL,
>     "COLUMN_TYPE" VARCHAR(128) NOT NULL,
>     "LONG_LOW_VALUE" BIGINT,
>     "LONG_HIGH_VALUE" BIGINT,
>     "DOUBLE_LOW_VALUE" DOUBLE,
>     "DOUBLE_HIGH_VALUE" DOUBLE,
>     "BIG_DECIMAL_LOW_VALUE" VARCHAR(4000),
>     "BIG_DECIMAL_HIGH_VALUE" VARCHAR(4000),
>     "NUM_DISTINCTS" BIGINT,
>     "NUM_NULLS" BIGINT NOT NULL,
>     "AVG_COL_LEN" DOUBLE,
>     "MAX_COL_LEN" BIGINT,
>     "NUM_TRUES" BIGINT,
>     "NUM_FALSES" BIGINT,
>     "LAST_ANALYZED" BIGINT,
>     "CS_ID" BIGINT NOT NULL,
>     "TBL_ID" BIGINT NOT NULL,
>     "BIT_VECTOR" BLOB,
>     "ENGINE" VARCHAR(128) NOT NULL
> );
> {noformat}
> The idea is to have a single blob named _STATISTICS_ to replace them, as follows:
> {noformat}
> CREATE TABLE "APP"."TAB_COL_STATS"(
>     "CAT_NAME" VARCHAR(256) NOT NULL,
>     "DB_NAME" VARCHAR(128) NOT NULL,
>     "TABLE_NAME" VARCHAR(256) NOT NULL,
>     "COLUMN_NAME" VARCHAR(767) NOT NULL,
>     "COLUMN_TYPE" VARCHAR(128) NOT NULL,
>     "STATISTICS" BLOB,
>     "LAST_ANALYZED" BIGINT,
>     "CS_ID" BIGINT NOT NULL,
>     "TBL_ID" BIGINT NOT NULL,
>     "ENGINE" VARCHAR(128) NOT NULL
> );
> {noformat}
> The _STATISTICS_ column could be the serialization of a Json-encoded string, which will be consumed in a "schema-on-read" fashion.
> At first at least the removed column statistics will be encoded in the _STATISTICS_ column, but since each "consumer" will read the portion of the schema it is interested into, multiple engines (see the _ENGINE_ column) can read and write statistics as they deem fit.
> Another advantage is that, if we plan to add more statistics in the future, we won't need to change the thrift interface for the metastore again.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)