You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2018/09/24 17:00:00 UTC
[jira] [Updated] (IMPALA-7601) Define better default selectivity values

     [ https://issues.apache.org/jira/browse/IMPALA-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers updated IMPALA-7601:
--------------------------------
    Summary: Define better default selectivity values  (was: Define a-priori selectivity and NDV values)

> Define better default selectivity values
> ----------------------------------------
>
>                 Key: IMPALA-7601
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7601
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.12.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> Impala makes extensive use of table stats during query planning. For example, the NDV (number of distinct values) is used to compute _selectivity_, the degree of reduction (also called the _reduction factor_) provided by a predicate. For example:
> {noformat}
> SELECT * FROM t WHERE t.a = 10
> {noformat}
> If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform distribution of values), that the above query will pick out one of these 100 values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity of the predicate {{t.a = 10}} is 0.01.
> For predicate operators other than equality ({{=}}), Impala uses 0.1 for the estimated selectivity. There are two observations. First, Impala does use a-priority selectivity estimates for these operators (NDV does not help with inequality, for example.) Second, the 0.01 is a bit too much of a one-size-fits-all estimate.
> h4. Selectivity Without Stats
> All this is good. But, what happens if statistics are not available for table {{t}}? How are we to know the selectivity of the predicate?
> It could be that {{t.a}} contains nothing but the value 10, so there is no reduction at all. I could be that {{t.a}} contains no values of 10, so the reduction is total, no rows are returned. The classic solution is to assume that the user put the predicate in the query for the purpose of subsetting the data. The classic value, shown in the [Ramakrishnan and Gehrke book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html] is to assume a 90% reduction, or a selectivity of 0.1. Indeed this value is seen in [Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:
> {noformat}
>   // To be used where we cannot come up with a
>   // better estimate (selectivity_ is -1).
>   public static double DEFAULT_SELECTIVITY = 0.1;
> {noformat}
> As it turns out, however, the actual implementation is a bit more complex, as hinted at by the above comment. Impala relies on stats. Given stats, specifically the NDV, we compute selectivity as:
> {noformat}
> selectivity = 1 / ndv
> {noformat}
> What happens if there is no available NDV? In the case, we skip the selectivity calculation and leave it at a special default value of -1.0, which seems to indicate "unknown". See [{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].
> Later, when we use the selectivity to calculate reduction factors, we simply skip any node with a selectivity of -1. You can see that in [{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].
> The result is that Impala is a bit more strict than classic DB optimizers. If stats are present, they are used. If stats are not present, Impala assumes that predicates have no effect.
> This ticket proposal a number of interrelated changes to add a-priori (before observation) defaults for selectivity and NDV based on classic DB practice.
> h4. Proposal: Add A-priori Selectivity Values
> But, we said earlier that users include a predicate because they expect it to do something. So, we are actually discarding the (albeit vague) information that the user provided.
> This is why many optimizers go ahead and assume a default 0.1 reduction factor for equality predicates even if no stats are available. The first proposal of this ticket is to use that default reduction factor even if no stats are present. This says that some reduction will occur, but, to be conservative, we assume not a huge reduction.
> h4. Proposal: Add Selectivity for All Predicate Operators
> As present, Impala computes reduction factors only for equality nodes. (See IMPALA-7560.) The book suggests rule-of-thumb estimates for other operators:
>  * {{!=}} - 0.1
>  * {{<}}, {{<=}}, {{>}}, {{>=}} - 0.3
>  * {{BETWEEN}} - 0.25
> Over in the Drill project, DRILL-5254 attempted to work out better estimates based on math and probability. However, the conclusion there was that, without NDV and histograms, there is more information in the user's intent than in the math. That is, if the user writes {{WHERE t.a != 10}}, there is a conditional probability that the user believes that this is a highly restrictive predicate, especially on big data. So, the reduction factor (which is a probability) is the same for {{=}} and {{!=}} in the absence of information. The same reasoning probably led to the rule-of-thumb values in the [R&G|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html] book.
> So, the second proposal is that Impala use the classic numbers for other operators when no stats are available.
> h4. Proposal: Use Stats-Based Selectivity Estimates When Available
> If stats are available, then we can "run the numbers" and get better estimates:
>  * {{p(a = x)}} = 1 / NDV
>  * {{p(a != x)}} = {{1 - p(a = x)}} = 1 - 1 / NDV
> So, the third proposal is to use the above math for the {{!=}} case.
> h4. Proposal: Use Defaults when Stats are Unavailable or Not Useful
> As it turns out, without histograms, the NDV by itself gives us no information by which to compute reduction for an inequality. So the, classic a-priori defaults are the still best guess.
> This leads to the fourth proposal: that the classic defaults are used unless stats provides additional information to use instead. Said another way, retire the current selectivity = -1 ("undefined") concept: we will always have a selectivity, even if just a rule-of-thumb one. That is, some information is better than none.
> h4. Proposal: Define an A-priori NDV Estimate
> Impala uses NDV for a number of purposes. If no stats are available, Impala assumes no NDV. Given the assumed reduction factor, we can guess an NDV = 1/0.1 = 10. Depending on where NDV is used (need to research this), we might want to choose some other value, perhaps 100. So, the fifth proposal is to assume an a-priori (before observation) NDV value, with the actual default value TBD.
> h4. Proposal: Separate the NDV=0 and Unknown NDV cases
> As described in IMPALA-7310, Impala does not currently handle the case of a table with stats, but a column in the table contains only NULL values. The stats mechanism defines HDV as "number of distinct non-null values". So, a table full of nulls has NDV = 0. Unfortunately, if no stats are available at all, we also have NDV = 0.
> So, the sixth proposal is to separate these cases. If we can tell that a column is a nullable type, assume that the actual NDV is (stats NDV + 1). (This is, in fact the NDV calc used by Impala itself when computing NDV for expressions. See [{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java].
> To avoid impact to existing tests, apply this rule only when it makes an impact, when the NDV value is, say, less than 10. (No reason to count the null value if NDV is large.) So, a column with all nulls will have an (adjusted) NDV = 1.
> Then, as noted above, if no stats exist, use the a-priori NDV.
> h4. References
>  * [Cost Estimation Overview|https://courses.cs.washington.edu/courses/cse544/09wi/lecture-notes/lecture8/lecture8.pdf]
>  * [Reduction factors|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html] from the classic Ramakrishnan and Gehrke book.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org