You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Abhishek Rawat (Jira)" <ji...@apache.org> on 2020/11/02 18:28:00 UTC

[jira] [Commented] (IMPALA-7876) COMPUTE STATS TABLESAMPLE is not updating number of estimated rows

    [ https://issues.apache.org/jira/browse/IMPALA-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224873#comment-17224873 ] 

Abhishek Rawat commented on IMPALA-7876:
----------------------------------------

The core issue here is that the child query computing the num_rows (table stats) uses ROUND function which returns the results as a *DECIMAL* type. Eg. below.
SELECT ROUND(COUNT(*) / 0.8935390115) FROM t1 TABLESAMPLE SYSTEM(10) REPEATABLE(1598511315168)
The CatalogOpExecutor when setting the table stats expects the data type to be *BIGINT*.

[https://github.com/apache/impala/blob/master/be/src/exec/catalog-op-executor.cc#L243]

[https://github.com/apache/impala/blob/master/be/src/exec/catalog-op-executor.cc#L255]

This used to work in the past because ROUND used to return results as type BIGINT.

This behavior was later changed for the better in this [commit|http://mpala-6230%2C%20impala-6468:%20Fix%20the%20output%20type%20of%20round()%20and%20related%20fns/].

There are couple of ways to fix this issue. I am leaning towards a fix which will add a *CAST as BIGINT* in the generated SQL for the child query, since num_rows should be a BIGINT.

[https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java#L548]

Also, probably best to fix this in the child query's sql, rather than adding implicit casts else where in the code.

 

> COMPUTE STATS TABLESAMPLE is not updating number of estimated rows
> ------------------------------------------------------------------
>
>                 Key: IMPALA-7876
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7876
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Andre Araujo
>            Assignee: Abhishek Rawat
>            Priority: Critical
>
> Running the command below seems to have no impact on the #rows stats.
> {code}
> [host:21000] default> COMPUTE STATS wide TABLESAMPLE SYSTEM(5);
> Query: COMPUTE STATS wide TABLESAMPLE SYSTEM(100)
> +-------------------------------------------+
> | summary                                   |
> +-------------------------------------------+
> | Updated 1 partition(s) and 103 column(s). |
> +-------------------------------------------+
> WARNINGS: Ignoring TABLESAMPLE because the effective sampling rate is 100%.
> The minimum sample size is COMPUTE_STATS_MIN_SAMPLE_SIZE=1.00GB and the table size 20.35GB
> Fetched 1 row(s) in 43.67s
> [host:21000] default> show table stats wide;
> Query: show table stats wide
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> | #Rows | Extrap #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                            |
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> | 0     | -1           | 84     | 20.35GB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://ns1/user/hive/warehouse/wide |
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> Fetched 1 row(s) in 0.01s
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org