You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2017/09/27 14:18:00 UTC

[jira] [Created] (HIVE-17618) Extend ANALYZE TABLE / DESCRIBE FORMATTED functionality with distribution of selected file-level metadata fields

Zoltan Ivanfi created HIVE-17618:
------------------------------------

             Summary: Extend ANALYZE TABLE / DESCRIBE FORMATTED functionality with distribution of selected file-level metadata fields
                 Key: HIVE-17618
                 URL: https://issues.apache.org/jira/browse/HIVE-17618
             Project: Hive
          Issue Type: Improvement
            Reporter: Zoltan Ivanfi


DESCRIBE FORMATTED already shows the number of files:

{noformat}
[...]
Table Parameters:
    COLUMN_STATS_ACCURATE   true
    numFiles                14
    numRows                 15653
[...]
{noformat}

It would be useful to break this number down by different file-level metadata fields. Once such field would be the different compression settings used in the table. Currently there is no way to check whether the contents of a table are compressed because some files can be compressed while others not. A file-count breakdown could provide this missing information in the following form:

{noformat}
[...]
Table Parameters:
    COLUMN_STATS_ACCURATE   true
    numFiles                14
        breakdown by compression:
            Uncompressed:   3
            Snappy:         6
            Deflate:        5
    numRows                 15653
[...]
{noformat}

Another useful breakdown would be by the writer field of Parquet files, because Impala writes Parquet files slightly differently (string fields are not annotated with UTF8 by default, timestamps are not adjusted to UTC) and users may want to know what kind of Parquet files are in a table but have no way to query it at this moment. An example output for Parquet tables could look like:

{noformat}
[...]
Table Parameters:
    COLUMN_STATS_ACCURATE   true
    numFiles                14
        breakdown by compression:
            Uncompressed:   3
            Snappy:         6
            Deflate:        5
        breakdown by writer:
            parquet-mr:     9
            impala:         5
    numRows                 15653
[...]
{noformat}

Any other file-level metadata could be incorporated that we consider useful to the user. Since gathering file-level metadata is an expensive operation, it should be done when the user issues ANALYZE TABLE ... COMPUTE STATISTICS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)