You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2020/06/23 15:55:00 UTC

[jira] [Created] (IMPALA-9883) Fix stats extrapolation works for full ACID tables

Zoltán Borók-Nagy created IMPALA-9883:
-----------------------------------------

             Summary: Fix stats extrapolation works for full ACID tables
                 Key: IMPALA-9883
                 URL: https://issues.apache.org/jira/browse/IMPALA-9883
             Project: IMPALA
          Issue Type: Sub-task
            Reporter: Zoltán Borók-Nagy


Full ACID tables have _delta_ and _delete delta_ files. Delta files contain the inserted table data, while delete delta files contain tombstones that denotes the deleted rows.

Therefore the result of a SELECT contains the rows coming from the delta files minus the rows whose tombstone is present in the delete delta files. See Full ACID Milestone 4 for more details.

Stats extrapolation uses file sampling. E.g. if the user issues COMPUTE STATS table TABLESAMPLE (10); then Impala will randomly select files whose aggregated byte size is at least 10% of the total byte size of the table files. Unfortunately for ACID tables this method doesn't estimate the stats correctly.

To calculate the stats more precisely we need to change the sampling method this way:
 * select all delete delta files
 * select some percentage (provided by TABLESAMPLE) of the delta files
 * extrapolate stats using the total byte size of all delta files (not all table files since those include the delete deltas)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org