You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2018/02/07 23:55:00 UTC

[jira] [Created] (IMPALA-6491) More robust HBase scan cardinality estimation

Alexander Behm created IMPALA-6491:
--------------------------------------

Summary: More robust HBase scan cardinality estimation
Key: IMPALA-6491
URL: https://issues.apache.org/jira/browse/IMPALA-6491
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0, Impala 2.7.0, Impala 2.6.0, Impala 2.5.0
Reporter: Alexander Behm

There are a few issues with our HBase scan cardinality estimation:
1. The cardinality estimates can be very inaccurate leading to bad plan choices. In particular, users have reported cases of severe underestimation which can have a ripple effect in the query plan (e.g. planner thinks a join with that table is selective)
2. Unlike HDFS scans, we do not use row count statistics from the Hive Metastore for estimating the cardinality of HBase scans. Instead, we do a small scan over the HBase table and estimate a row count based on the average bytes per row and the storefile size.

There are other more detailed caveats with the HBase estimation method.

The original motivation of this method was to adjust the row count for queries that only scan a subset of the region servers (the HMS statistics only cover the entire table).

*Proposal*
To address these shortcomings, we could start with the table-level row count store in the Metastore and then adjust that number based on the total number of bytes in the table and the number of bytes in the relevant region servers.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)