You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2019/01/04 20:53:00 UTC

[jira] [Created] (IMPALA-8048) Improve join cardinality estimation: urn model, NDV tracking, etc.

Paul Rogers created IMPALA-8048:
-----------------------------------

             Summary: Improve join cardinality estimation: urn model, NDV tracking, etc.
                 Key: IMPALA-8048
                 URL: https://issues.apache.org/jira/browse/IMPALA-8048
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers


Work is underway on a number of JIRA tickets to improve cardinality estimates. That work is constrained by the possible need to back-port to prior releases. As a result, the changes are made within the existing context to minimize the impact.

The current model makes a number of naive assumptions, however, that should be addressed in a second batch of changes which will entail a wider code impact.

h4. Adopt the Urn Model for NDV Estimation.

Suppose we have a table alumni(name, sex, class) with values such as:

{noformat}
John Smith, M, 2008
Jane Doe, F, 1993
...
{noformat}

We have 50 years of data, 1000 rows per year, or 50K rows. We have these stats:

{noformat}
|alumni| = 50K
|name| = 49K
|sex| = 2
|class| = 50
{noformat}

We have the following query which fills in the graduation date for each class:

{code:sql}
select * from alumni, grad_dates where sex='F' where alumni.class = grad_dates.class
{code}

Focusing just on the alumni table, how many classes will be available to match? That is, what is {{|class'|}}, the NDV of the name field after accounting for the affect of the predicate {{sex='F'}}.

Today we work it out with a linear model as follows:

{code}
sel(sex = 'F') = 1/|F| = 1/2 = 0.5
|sex'| = |sex| * sel(sex = 'F') = 2 * 0.5 = 1
|class'| = |class| * sel(sex = 'F') = 50 * 0.5 = 25
{code}

The math works for the {{sex}} field: the correct adjusted NDV is 1.

What about for {{class}}? Since the predicate eliminated half the rows, it eliminated half the class values. But, this can't be right. Surely women graduated in all classes. What went wrong?

The problem is the linear assumption. As shown in the [SwamiI and Schiefer|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf] paper, Section 5, the correct estimation technique is the urn model. See the paper for details. Using that model:

{noformat}
|x'| = (1 - (1 - 1/|x|)^|T')

|alumni'| = |alumni| * sel(sex = 'F') = 50K * .5 = 25K
|class'| = |class| * (1 - (1 - 1/50) ^ 25K) = 50
{noformat}

That is, as the cardinality of the selected table grows larger, the probability reaches 1 that other, non-correlated values will still appear. This, though we remove half the rows, all the classes are still represented.

h4. Per-Tuple Column NDV Tracking

At present, after the current round of changes, we use a linear model to estimate column NDV after filtering, and use the same model for all columns. If we adopt the urn model, then we must treat columns separately. In the above, we do *not* want to apply the urn model to the {{sex}} column. Why? We already know its cardinality from the filter predicate. Don't want to replace it with an estimated urn-model value. This problem is more acute if you consider a range predicate, such as those used on partitions: {{class > 2009}}.

To make the above work, we have to track NDV per column. That is, the scan node must provide a list of columns and their NDVs after scanning. Columns mentioned in a predicate have their NDVs estimated from selectivity. All other columns have their NDVs estimated from the urn model. (There are several ways to implement this; the point is that some columns must be singled out for special treatment.)

h4. Proper Join-to-Table Join Column NDVs

The NDV adjustment model says that, to compute the join cardinality, we need the adjusted column cardinality (NDV). When joining one table to another, it is clear how to adjust the column NDVs for each table: each is done according to the rules spelled out above.

A complexity arises, however, when we want to join three tables: we have ((A ⋈ B) ⋈ C). How do we adjust the NDVs for the columns created by the (A ⋈ B) join? If we simply adjust the NDV of table columns using a common selectivity (as done in the simple linear model), then we are correct for the columns from one table, but wrong for columns from the other. Why? The two table had different selectivities applied, we can't reduce them to a common number. 

The solution is the per-column adjusted NDV tracking: we'd know to apply one set of adjustments for columns from the left table, another for the right.

This requires additional data structures in each plan node.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org