You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2019/01/22 00:14:00 UTC
[jira] [Updated] (IMPALA-8094) Reconcile Impala's column NDV metadata definition with Hive's

     [ https://issues.apache.org/jira/browse/IMPALA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers updated IMPALA-8094:
--------------------------------
    Description: 
Impala computes the NDV statistic as the number of unique non-null values. Recently Impala has re-enabled a separate null count. So, a column with values ("A", NULL, "B") will report NDV=2, null count = 1. A column with (NULL, NULL) will report NDV=0, null count = 2.

While it is a bit unclear, it appears that Hive *does* include nulls with computing NDV. So, in the above examples, the Hive-generated NDV values would be 3 and 1 respectively.

[This discussion|https://github.com/prestodb/presto/issues/10674] suggests that Hive does count NDVs.

This, then, creates an ambiguity in Impala. Should we adjust NDV to include nulls (if NDV was generated by Impala) or not (if NDV was generated from Hive.)

IMPALA-7310 proposed a fix that supposed that the NDV never includes the null count. But, since the NDV will be different depending on whether Hive or Impala did the calculation, the simple fix in IMPALA-7310 can't be the entire fix. (Though, since the above post points out that Hive NDV is an estimate, it may still be a good idea to adjust small values.)

This ticket asks that:

* Verify that Hive does, indeed, include nulls in NDV count.
* If so, modify the Impala NDV count to add one to NDV if null count > 0 when writing to HMS.

Hive semantics are hazy. The [Hive stats design doc|https://cwiki-test.apache.org/confluence/display/Hive/Column+Statistics+in+Hive] does not specify how NDV should be computed. The [config docs|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics] point out that NDV can be an estimate. More information is available [here|https://cwiki.apache.org/confluence/display/Hive/StatsDev], but still no definition of NDV.

Be careful not to change the meaning of the NDV function itself since it is defined as the number of distinct non-null rows. As noted in the [NDV function docs|https://impala.apache.org/docs/build3x/html/topics/impala_ndv.html]:

bq. An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the "number of distinct values".

And the [COUNT function docs|https://impala.apache.org/docs/build3x/html/topics/impala_count.html]:

bq. An aggregate function that returns the number of rows, or the number of non-NULL rows.

Note that, as an odd twist, for only the Boolean type, Impala does include nulls in NDV. That is, if one has a table with a Boolean column, and that column has nulls, then the NDV of that column is 3. This is *not* done for any other data type.

  was:
Impala computes the NDV statistic as the number of unique non-null values. Recently Impala has re-enabled a separate null count. So, a column with values ("A", NULL, "B") will report NDV=2, null count = 1. A column with (NULL, NULL) will report NDV=0, null count = 2.

While it is a bit unclear, it appears that Hive *does* include nulls with computing NDV. So, in the above examples, the Hive-generated NDV values would be 3 and 1 respectively.

[This discussion|https://github.com/prestodb/presto/issues/10674] suggests that Hive does count NDVs.

This, then, creates an ambiguity in Impala. Should we adjust NDV to include nulls (if NDV was generated by Impala) or not (if NDV was generated from Hive.)

IMPALA-7310 proposed a fix that supposed that the NDV never includes the null count. But, since the NDV will be different depending on whether Hive or Impala did the calculation, the simple fix in IMPALA-7310 can't be the entire fix. (Though, since the above post points out that Hive NDV is an estimate, it may still be a good idea to adjust small values.)

This ticket asks that:

* Verify that Hive does, indeed, include nulls in NDV count.
* If so, modify the Impala NDV count to add one to NDV if null count > 0 when writing to HMS.

Hive semantics are hazy. The [Hive stats design doc|https://cwiki-test.apache.org/confluence/display/Hive/Column+Statistics+in+Hive] does not specify how NDV should be computed. The [config docs|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics] point out that NDV can be an estimate. More information is available [here|https://cwiki.apache.org/confluence/display/Hive/StatsDev], but still no definition of NDV.

Be careful not to change the meaning of the NDV function itself since it is defined as the number of distinct non-null rows. As noted in the [NDV function docs|https://impala.apache.org/docs/build3x/html/topics/impala_ndv.html]:

bq. An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the "number of distinct values".

And the [COUNT function docs|https://impala.apache.org/docs/build3x/html/topics/impala_count.html]:

bq. An aggregate function that returns the number of rows, or the number of non-NULL rows.


> Reconcile Impala's column NDV metadata definition with Hive's
> -------------------------------------------------------------
>
>                 Key: IMPALA-8094
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8094
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Impala computes the NDV statistic as the number of unique non-null values. Recently Impala has re-enabled a separate null count. So, a column with values ("A", NULL, "B") will report NDV=2, null count = 1. A column with (NULL, NULL) will report NDV=0, null count = 2.
> While it is a bit unclear, it appears that Hive *does* include nulls with computing NDV. So, in the above examples, the Hive-generated NDV values would be 3 and 1 respectively.
> [This discussion|https://github.com/prestodb/presto/issues/10674] suggests that Hive does count NDVs.
> This, then, creates an ambiguity in Impala. Should we adjust NDV to include nulls (if NDV was generated by Impala) or not (if NDV was generated from Hive.)
> IMPALA-7310 proposed a fix that supposed that the NDV never includes the null count. But, since the NDV will be different depending on whether Hive or Impala did the calculation, the simple fix in IMPALA-7310 can't be the entire fix. (Though, since the above post points out that Hive NDV is an estimate, it may still be a good idea to adjust small values.)
> This ticket asks that:
> * Verify that Hive does, indeed, include nulls in NDV count.
> * If so, modify the Impala NDV count to add one to NDV if null count > 0 when writing to HMS.
> Hive semantics are hazy. The [Hive stats design doc|https://cwiki-test.apache.org/confluence/display/Hive/Column+Statistics+in+Hive] does not specify how NDV should be computed. The [config docs|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics] point out that NDV can be an estimate. More information is available [here|https://cwiki.apache.org/confluence/display/Hive/StatsDev], but still no definition of NDV.
> Be careful not to change the meaning of the NDV function itself since it is defined as the number of distinct non-null rows. As noted in the [NDV function docs|https://impala.apache.org/docs/build3x/html/topics/impala_ndv.html]:
> bq. An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the "number of distinct values".
> And the [COUNT function docs|https://impala.apache.org/docs/build3x/html/topics/impala_count.html]:
> bq. An aggregate function that returns the number of rows, or the number of non-NULL rows.
> Note that, as an odd twist, for only the Boolean type, Impala does include nulls in NDV. That is, if one has a table with a Boolean column, and that column has nulls, then the NDV of that column is 3. This is *not* done for any other data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org