You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2019/01/09 00:26:00 UTC
[jira] [Comment Edited] (IMPALA-8058) HBase scan cardinality division-by-zero leads to bogus cardinality

    [ https://issues.apache.org/jira/browse/IMPALA-8058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737681#comment-16737681 ] 

Paul Rogers edited comment on IMPALA-8058 at 1/9/19 12:25 AM:
--------------------------------------------------------------

Proposed fix:

* Retain the existing code, except...
* If the estimated row width is less than 1, return -1 as the estimate.
* In the HBase scan node, if we get back -1 from the cardinality estimator, use the row count from HMS table stats.

Multiply the total row count by filter selectivity to get scan cardinality.

Note that the existing HBase scan node double-counts filter cardinality:

* It uses the key range estimator described above to estimate the rows in that range.
* Applies the predicate selectivity a second time in the scan node.

So, a further fix is to:

* Apply all predicates only if we are using the HMS table stats row count.
* Apply only non-key predicate selectivity if we are using the (smaller) key range row count.


was (Author: paul.rogers):
Proposed fix:

* Retain the existing code, except...
* If the estimated row width is less than 1, return -1 as the estimate.
* In the HBase scan node, if we get back -1 from the cardinality estimator, use the row count from HMS table stats.

Multiply the total row count by filter selectivity to get scan cardinality.

Note that the existing HBase scan node double-counts filter cardinality:

* It uses the key range estimator described above to estimate the rows in that range.
* Applies the predicate selectivity a second time in the scan node.

So, a further fix is to:

* Apply all predicates only if we are using the HMS table stats row count.
* Apply only non-key predicate selectivity if we are using the (smaller) key range row count.

> HBase scan cardinality division-by-zero leads to bogus cardinality
> ------------------------------------------------------------------
>
>                 Key: IMPALA-8058
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8058
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> A particular HBase query has highly selective key filters and runs into code bugs that produce a bogus, huge cardinality value.
> {{HbaseScanNode.computeStats()}} attempts to compute table cardinality by calling {{HBaseTable.getEstimatedRowStats()}}. This then calls into (in the latest versions) {{FeHBaseTable.getEstimatedRowStats()}}.
> This code tries to estimate cardinality by:
> * Scanning a set of regions.
> * For each getting the size.
> * Averaging a bunch of rows to estimate row width.
> Once we know the size of the regions we need to scan, and the average row width, we can compute the scan cardinality.
> The problem in this particular query is that the predicates are so selective that no regions match. As a result, the average row width is zero. We divide (as a double) the region size by 0 and get INF. We cast that to a long and get Long.MAX_VALUE. We then use that as our (highly bogus) cardinality estimate.
> The code must:
> * Detect the division-by-zero (now sample rows) case.
> * Use an alternative estimate (such as multiplying total table row count from HMS by the filter selectivity.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org