You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2017/05/24 20:00:07 UTC

[jira] [Created] (IMPALA-5360) Prefer exact partition row counts in scan cardinality estimation with row count extrapolation enabled.

Alexander Behm created IMPALA-5360:
--------------------------------------

             Summary: Prefer exact partition row counts in scan cardinality estimation with row count extrapolation enabled.
                 Key: IMPALA-5360
                 URL: https://issues.apache.org/jira/browse/IMPALA-5360
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 2.9.0
            Reporter: Alexander Behm


With IMPALA-2373 we added row count extrapolation for scan-cardinality estimation. The simple approach taken in IMPALA-2373 has the downside that per-partition row counts are ignored in favor of always extrapolating based on the number of bytes scanned.

That choice the following undesirable side-effect: Immediately after running COMPUTE STATS, the estimated row count of a subset of partitions may be slightly inaccurate, even if the precise row count could have been determined from the per-partition stats. The problem is that we do not know if the partitions have changed since the last COMPUTE STATS, and so we always prefer to extrapolate to accommodate potential changes.

A possible improvement is to implement per-partition change detection, e.g., by computing and storing a hash of file names and sizes at the time of COMPUTE STATS. Later, during cardinality estimation we can use the hash to determine whether partitions have changed since the last stats collection and decide whether to use the exact per-partition row count or extrapolate.

It is unclear how exactly this change detection should interact with COMPUTE STATS TABLESAMPLE (IMPALA-5310).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)