You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Denys Ordynskiy (Jira)" <ji...@apache.org> on 2020/01/31 21:24:00 UTC
[jira] [Closed] (DRILL-7357) Expose Drill Metastore data through INFORMATION_SCHEMA

     [ https://issues.apache.org/jira/browse/DRILL-7357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Denys Ordynskiy closed DRILL-7357.
----------------------------------

Successfully tested INFORMATION_SCHEMA tables.

New columns in the `*TABLES*` table filling by the 'REFRESH METADATA' command.

In `*COLUMNS*` table all fields gets data from 'REFRESH METADATA' command. Instead of `COLUMN_DEFAULT`, `COLUMN_FORMAT` columns. Drill will use these columns later in the future Metastore implementations.
To fill the columns `NDV` and `EST_NUM_NON_NULLS` option `planner.statistics.use` should be 'true'.

Parquet table with subdirectories was used to fill `*PARTITIONS*` table.

> Expose Drill Metastore data through INFORMATION_SCHEMA
> ------------------------------------------------------
>
>                 Key: DRILL-7357
>                 URL: https://issues.apache.org/jira/browse/DRILL-7357
>             Project: Apache Drill
>          Issue Type: Sub-task
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>            Priority: Major
>              Labels: ready-to-commit
>             Fix For: 1.17.0
>
>
> Document:
> https://docs.google.com/document/d/10CkLdrlUJUNRrHKLeo8jTUJB8xAP1D0byTOvn8wNoF0/edit#heading=h.gzj2dj5a4yds
> Sections: 
> 5.19 INFORMATION_SCHEMA updates
> 4.3.2 Using the statistics
> information_schema tables will contain data from Metastore only if {{metastore.enabled}} is set to true.
> This Jira will add additional columns to TABLES and COLUMNS tables and new PARTITIONS table.
> Note: new columns or table are applicable only for Metastore data, for data from different sources these columns will be set to null.
> Additional columns
> *TABLES:*
> TABLE_SOURCE - table data type: PARQUET, CSV, JSON
> LOCATION - table location: /tmp/nation
> NUM_ROWS - number of rows in a table if know, null if not known 
> LAST_MODIFIED_TIME - table's last modification time
> *COLUMNS:*
> COLUMN_SIZE (already existed but was not included, applicable for all sources) - estimated column size, for example for boolean 1, for integer 11 (sign + 10 digits), etc.
> COLUMN_DEFAULT (already existed but never was filled in) - column default value  
> COLUMN_FORMAT - usually applicable for date time columns: yyyy-MM-dd
> NUM_NULLS - number of nulls in column values
> MIN_VAL - column min value in String representation: aaa
> MAX_VAL - column max value in String representation: zzz
> NDV - number of distinct values in column, expressed in Double
> EST_NUM_NON_NULLS - estimated number of non null values, expressed in Double
> IS_NESTED - if column is nested. Nested columns are extracted from columns with struct type.
> *PARTITIONS* table columns:
> TABLE_CATALOG - table catalog (currently we have only one catalog): DRILL
> TABLE_SCHEMA - table schema: dfs.tmp
> TABLE_NAME - table name: nation
> METADATA_KEY - top level segment key, he same for all nested segments and partitions: part_int=3
> METADATA_TYPE - SEGMENT or PARTITION
> METADATA_IDENTIFIER - current metadata identifier: part_int=3/part_varchar=g
> PARTITION_COLUMN - partition column name: part_varchar
> PARTITION_VALUE - partition column  value: g
> LOCATION - segment location, null for partitions: /tmp/nation/part_int=3
> LAST_MODIFIED_TIME - last modification time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)