You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2018/02/07 23:55:00 UTC

[jira] [Created] (IMPALA-6491) More robust HBase scan cardinality estimation

Alexander Behm created IMPALA-6491:
--------------------------------------

             Summary: More robust HBase scan cardinality estimation
                 Key: IMPALA-6491
                 URL: https://issues.apache.org/jira/browse/IMPALA-6491
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0, Impala 2.7.0, Impala 2.6.0, Impala 2.5.0
            Reporter: Alexander Behm


There are a few issues with our HBase scan cardinality estimation:
1. The cardinality estimates can be very inaccurate leading to bad plan choices. In particular, users have reported cases of severe underestimation which can have a ripple effect in the query plan (e.g. planner thinks a join with that table is selective)
2. Unlike HDFS scans, we do not use row count statistics from the Hive Metastore for estimating the cardinality of HBase scans. Instead, we do a small scan over the HBase table and estimate a row count based on the average bytes per row and the storefile size.

There are other more detailed caveats with the HBase estimation method.

The original motivation of this method was to adjust the row count for queries that only scan a subset of the region servers (the HMS statistics only cover the entire table).

*Proposal*
To address these shortcomings, we could start with the table-level row count store in the Metastore and then adjust that number based on the total number of bytes in the table and the number of bytes in the relevant region servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)