You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2019/01/04 02:13:00 UTC

[jira] [Commented] (IMPALA-8024) HBase table cardinality estimates are wrong

    [ https://issues.apache.org/jira/browse/IMPALA-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733736#comment-16733736 ] 

Paul Rogers commented on IMPALA-8024:
-------------------------------------

Another example, {{functional_hbase.stringids}}:

{noformat}
Query: show table stats stringids
+-----------------+--------------+------------+--------+
| Region Location | Start RowKey | Est. #Rows | Size   |
+-----------------+--------------+------------+--------+
| localhost       |              | 10         | 0B     |
| localhost       | 1            | 4295       | 1.00MB |
| localhost       | 3            | 4267       | 1.00MB |
| localhost       | 5            | 4292       | 1.00MB |
| localhost       | 7            | 4290       | 1.00MB |
| localhost       | 9            | 10         | 0B     |
| Total           |              | 17164      | 4.00MB |
+-----------------+--------------+------------+--------+

Query: show column stats stringids
+-----------------+-----------+------------------+--------+----------+-------------------+
| Column          | Type      | #Distinct Values | #Nulls | Max Size | Avg Size          |
+-----------------+-----------+------------------+--------+----------+-------------------+
| id              | STRING    | 10000            | 0      | 4        | 3.888999938964844 |
...

select count(*) from stringids
+----------+
| count(*) |
+----------+
| 10000    |
+----------+
{noformat}

Here, {{id}} is unique, so its NDV reflects row count at the time of gathering stats. But, the estimated row count is 17K. Actual row count is 10K, same as the NDV in stats.

> HBase table cardinality estimates are wrong
> -------------------------------------------
>
>                 Key: IMPALA-8024
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8024
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> IMPALA-8021 added cardinality estimates to EXPLAIN plan output. Running some of our {{PlannerTest}} files revealed that our HBase cardinality estimates are very poor, even for our simple test tables. For example, for {{functional_hbase.alltypessmall}}:
> {{count\(*)}} tells us that there are 100 rows:
> {noformat}
> select count(*) from functional_hbase.alltypessmall
> +----------+
> | count(*) |
> +----------+
> | 100      |
> +----------+
> {noformat}
> Table stats claim that there are only 60 rows:
> {noformat}
> show table stats functional_hbase.alltypessmall;
> +-----------------+--------------+------------+------+
> | Region Location | Start RowKey | Est. #Rows | Size |
> +-----------------+--------------+------------+------+
> | localhost       |              | 10         | 0B   |
> | localhost       | 1            | 10         | 0B   |
> | localhost       | 3            | 10         | 0B   |
> | localhost       | 5            | 10         | 0B   |
> | localhost       | 7            | 10         | 0B   |
> | localhost       | 9            | 10         | 0B   |
> | Total           |              | 60         | 0B   |
> +-----------------+--------------+------------+------+
> {noformat}
> The NDV stats show that there must be at least 100 rows:
> {noformat}
> show column stats functional_hbase.alltypessmall
> +-----------------+-----------+------------------+--------+----------+----------+
> | Column          | Type      | #Distinct Values | #Nulls | Max Size | Avg Size |
> +-----------------+-----------+------------------+--------+----------+----------+
> | id              | INT       | 99               | 0      | 4        | 4        |
> ...
> | timestamp_col   | TIMESTAMP | 100              | 0      | 16       | 16       |
> ...
> +-----------------+-----------+------------------+--------+----------+----------+
> {noformat}
> Planning a query, the most critical part, thinks there are only 50 rows:
> {noformat}
> select *
> from functional.alltypesagg join functional_hbase.alltypessmall using (id, int_col)
> |--01:SCAN HBASE [functional_hbase.alltypessmall]
> |     row-size=89B cardinality=50
> {noformat}
> We need a more reliable estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org