You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2018/03/09 05:06:00 UTC

[jira] [Created] (IMPALA-6632) Document compatibility of table and column stats between Impala and Hive

Alexander Behm created IMPALA-6632:
--------------------------------------

             Summary: Document compatibility of table and column stats between Impala and Hive
                 Key: IMPALA-6632
                 URL: https://issues.apache.org/jira/browse/IMPALA-6632
             Project: IMPALA
          Issue Type: Improvement
          Components: Docs
            Reporter: Alexander Behm
            Assignee: Alex Rodoni


The question of compatibility between the table and column stats between Hive and Impala comes up quite often, so is worth documenting explicitly.

Quoting myself from a recent discussion thread to get the docs effort started:

Commonalities:
- Hive and Impala both store row counts at the table level and partition level. Hive also computes and stores additional stats like file counts which Impala does not need or use.

Differences:
- Impala computes and stores column-level stats like the number of distinct values (NDV) only at the table level, and not at the partition level.
- Hive computes and stores column-level stats at the partition level. Impala does not follow this approach because the per-partition NDVs cannot be sensibly combined for queries that access multiple partitions. In short, the column stats for partitioned tables are not compatible between Impala and Hive (because imo Hive's approach does not make sense).
- Impala uses a more modern and tuned algorithm (HyperLogLog++) for estimating the number of distinct values, so they tend to be more accurate than Hive's. Your mileage may vary.
- For unpartitioned tables, the Hive and Impala column stats are compatible.

For partitioned tables, the table-level column stats that Impala writes in the Metastore are stored just like for unpartitioned tables. These statistics are "available" to Hive in the sense that the standard retrieval APIs will work as expected.  My understanding is that for partitioned tables, Hive does not use the table-level column stats, but instead expects partition-level column stats. As I've said before, these partition-level column stats do not make any sense because it is not possible to sensibly combine them for multiple partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)