You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2019/01/02 19:19:00 UTC
[jira] [Updated] (IMPALA-8015) Incorrect cardinality calculation for the generic case

     [ https://issues.apache.org/jira/browse/IMPALA-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers updated IMPALA-8015:
--------------------------------
    Description: 
TL;DR: the expression used to calculate the cardinality for a M:N (“generic”) join is incorrect. Scroll to the end for the proposed fix. See IMPALA-8018 for math details.

h4. Current Implementation

The code uses the following:

{code:java}
      double lhsAdjNdv = slots.lhsNdv();
      if (slots.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard / slots.lhsNumRows();
      double rhsAdjNdv = slots.rhsNdv();
      if (slots.rhsNumRows() > rhsCard) rhsAdjNdv *= rhsCard / slots.rhsNumRows();
      long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv, rhsAdjNdv))) *
          rhsCard);
{code}

As noted in IMPALA-8014, the above attempts to adjust NDVs for the correlated filter case.

Where:

* {{lhsCard}} is the output cardinality of the l (previous join) table or {{|𝜎L|}}
* {{rhsCard}} is the output cardinality of the r (scanned) table or {{|𝜎R|}}
* {slots.lhsNdv()}} is the NDV of the left key or {{|L.k1|}}
* {slots.lnsNumRows()}} is the cardinality of the LHS table, or {{|L|}}
* {slots.rhsNdv()}} is the NDV of the right key or {{|R.k2|}}
* {slots.rnsNumRows()}} is the cardinality of the LHS table, or {{|R|}}

Translated to our notation:

{noformat}
adjNdv(L.k1) = |L.k1| * ( |𝜎L| / |L| if |L| > |𝜎L|, 1 otherwise )
adjNdv(R.k2) = |R.k2| * ( |𝜎R| / |R| if |R| > |𝜎R|, 1 otherwise )
|L’ ⋈ R’| = |L’| * |R’| / max(adjNdv(L.k1), adjNdv(R.k2))
{noformat}

We can make two simplifications:

* Since {{|X’| <= |X|}}, and when {{|𝜎X| = |X|}} then {{|X’| / |X| = 1}} we can drop the conditional term.
* Since {{|k| <= |X|}} and {{|𝜎X|}} is positive, we can see that neither {{adjNdv}} term can ever be less than one, so we can ignore the corresponding {{max()}} in the code.

This gives:

{noformat}
adjNdv(L.k1) = |L.k1| * |𝜎L| / |L|
adjNdv(R.k2) = |R.k2| * |𝜎R| / |R|

                    |L’| * |R’|
|L’ ⋈ R’| = -------------------------------
            max(adjNdv(L.k1), adjNdv(R.k2))
{noformat}

The meaning of the {{adjNdv}} values is that we reduce each NDV proportional to the the number of rows selected. If we take the adjusted NDVs as the correct values (but see below) the rest of the expression matches the derived expression.

h4. Code Bug

However, it *does not* make sense to adjust the NDVs for the M:N case — the one for which this code applies.

For example, suppose we start with 100 key values, and we select half the rows. How many keys are left?

* If we start with 100 rows, then, yes, we do reduce the keys by half.
* But, if we start with 10,000 rows, removing half the rows still leaves 5,000 rows to hold our 100 keys.

The intuition is right: the number of matches should decrease if the scan filter becomes more selective. But, that is exactly with the two scan term ({{|𝜎L|}} and {{|𝜎R|}} provide. It need not be done again for the keys.

The result of this bug that the code estimate will use NDVs that are too small, producing estimates which are too large, which throws of join selection and could degrade query performance if it leads to an inefficient plan.

That this attempted adjustment does not, in fact, work is not surprising. As noted in S&S (section 3.1): “We do not know of any algorithm that *correctly* takes into account both local predicates and join predicates.”

h4. Worked Example

Let's check with a detailed example:

* {{|L| = 10000}}
* {{|R| = 100}}
* {{|L’| = 3000}}
* {{|R’| = 50}}
* {{ndv(L.k1) = |L.k1| = 40}}
* {{ndv(R.k2) = |R.k2| = 20}}

Let's work this out intuitively first. The join is the normal cross join with adjustments:

{noformat}
|L’ ⋈ R’| = |L| * |R| * adjustments
{noformat}

Adjustments:

* The left side scan returns only 30% of the total rows. 
* The left hand has twice the key values of the right, so we have to discard half of the left rows.
* Of the L keys with a match in R, it will match, on average, 100/20 = 5 rows.
* But half of R is filtered out, reducing the available matching rows.

And:

{noformat}
|L’ ⋈ R’| = (|L| / 2) * (|R| / 20 / 2)
          = 10000 * 30% / 2 * 100 * / 20 / 2
          = 1500 * 2.5
          = 3750
{noformat}

Using the Equation 5 from IMPALA-8018 (which is Equation 2 in the S&S paper):

{noformat}
                |L’| * |R’|
|L’ ⋈ R’| = -------------------
            max(|L.k1|, |R.k2|)

          = 3000 * 50 / max(40, 20)
          = 3000 * 50 / 40
          = 3750
{noformat}

Which works out.

Now let's check the code's expression:

{noformat}
adjNdv(L.k1) = |L.k1| * |𝜎L| / |L|
             = 40 * 3000 / 10,000
             = 12
adjNdv(R.k2) = |R.k2| * |𝜎R| / |R|
             = 20 * 50 / 100
             = 10

                    |L’| * |R’|
|L’ ⋈ R’| = -------------------------------
            max(adjNdv(L.k1), adjNdv(R.k2))

          = 3000 * 50 / max(12, 10)
          = 150,000 / 12
          = 12,500
{noformat}

Which does not quite work out, verifying the bug. Note that, as a result of the bug, we overestimate the cardinality by 3 times.

h4. Proposed Fix

See IMPALA-8018 for the proposed fix for this and IMPALA-8014.

  was:
IMPALA-8015

TL;DR: the expression used to calculate the cardinality for a M:N (“generic”) join is incorrect. Scroll to the end for the proposed fix.

Please see the background information in IMPALA-8014, including notation. The expiation here builds on that material.

Impala uses two distinct ways to estimate join cardinality: FK/PK and the "generic" case. Both are in {{JoinNode.getJoinCardinality()}}.

The generic case handles the M:N case, multiple rows in the left table join with multiple rows in the right table. The {{getGenericJoinCardinality()}} works out the estimated cardinality.

h4. Deriving the Cardinality

Assume a join of two tables, left ({{T1}}) and right ({{T2}}), with no predicate. We have a Cartesian product:

{noformat}
|T1 ⋈ T2| = |T1| * |T2|
{noformat}

Suppose we have an equi-join predicate: {{T1.k1 = T2.k2}}. We can now use a hash join in which T1 is on the left (probe) side and T2 is on the build (right) side. We can thus rename (alias) the tables so {{L = T1, R = T2}}.

Because we are concerned with the M:N (generic) case, we assume that {{|R.k2| < |R|}}. This means that multiple build rows have the same key, say {{R.k2 = x}}. This then implies that each row of the L (probe) side will potentially match multiple rows on the R (build) side. 

We want to know, how many rows will each probe-side row match?

Let’s focus on a table-to-table join and assume we can obtain the following from HMS:

* Probe and build table cardinalities: {{|L|}} and {{|R|}}
* Key cardinalities (NDVs): {{|L.k1|}} and {{|R.k2|}}

The probe side will match rows on the build side where {{R 𝜎 R.k2 = L.k1}}. Using the uniformity assumption from the S&S paper (see IMPALA-8014) we can see that the the values of R.k1 divide the R table into a set of {{|R.k2|}} groups, the size of each must be:

{noformat}
|R 𝜎 R.k2 = x| = |R| / |R.k2|
{noformat}

Let’s assume that every row on the L side matches some row on the R side. Then, the join cardinality is just:

{noformat}
|L ⋈ R| = |L| * |R| / |R.k2|
{noformat}

Both the L and R tables may be subject to selection during scan (see IMPALA-8014) that is an unbiased sampling (given the uniformity assumption) of the rows of both tables. This reduces the rows available to join, but does not reduce the population of groups from which the sample is drawn. So:

{noformat}
|𝜎L ⋈ 𝜎R| = |𝜎L| * |𝜎R| / |R.k2|
{noformat}

Intuitively, however may rows are scanned, they are still divided into the same set of groups.

The above assumes that all rows from L match rows in R. (This is called the “containment assumption” in the S&S paper.) But, Big Data is messy. Perhaps there are more key values in L than R or visa-versa. We can make some reasonable assumptions:

* If there are fewer values in L.k1 than in R.k2, we can assume all probe rows will match a build key.
* If there are more values in L.k1 than in R.k2, we can assume we'll match all keys on the build side, then discard the extra probe values that don't match.

Again using the uniformity assumption, the probability is simply the ratio of the the number of keys available for matching (the right or probe side) divided by the number of keys we want to match (the left or probe side):

{noformat}
p(match) = /  |R.k2| / |L.k1| if |L.k1| > |R.k2|,
           \  1 otherwise
         = |R.k2| / max( |L.k1|, |R.k2| )
{noformat}

Let's check.

* If the ndv's are equal, the probability of a match is 1.
* If either table is empty, the probability is 0.
* If probe keys ({{|L.k1|}}) is half that of build keys ({{|R.k2|}}) then all probe rows will find a mach, so the probability is 1 (though half of the build side rows will go unmatched.)
* If we have twice as many probe keys {{|L.k1|}} as build keys ({{|R.k2|}}) then half probe rows won’t find a match and the probability of a match is 0.5.

All good.

Putting it all together:

{noformat}
|L’ ⋈ R’| = |L’| * |R’| / |R.k2| * p(match)
          = (|L’| * |R’| / |R.k2|) * |R.k2| / max(|R.k2|, |L.k1|)
          = |L’| * |R’| / max(|R.k2|, |L.k1|)
{noformat}

Rearranging terms, we get the M:N cardinality estimation expression:

{noformat}
                |L’| * |R’|
|L’ ⋈ R’| = -------------------            [Equation 1]
            max(|L.k1|, |R.k2|)
{noformat}

As it turns out, this is exactly Equation 2 in the S&S paper, which provides confirmation that the derivation is correct. 

h4. Current Implementation

The code uses the following:

{code:java}
      double lhsAdjNdv = slots.lhsNdv();
      if (slots.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard / slots.lhsNumRows();
      double rhsAdjNdv = slots.rhsNdv();
      if (slots.rhsNumRows() > rhsCard) rhsAdjNdv *= rhsCard / slots.rhsNumRows();
      long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv, rhsAdjNdv))) *
          rhsCard);
{code}

Where:

* {{lhsCard}} is the output cardinality of the l (previous join) table or {{|𝜎L|}}
* {{rhsCard}} is the output cardinality of the r (scanned) table or {{|𝜎R|}}
* {slots.lhsNdv()}} is the NDV of the left key or {{|L.k1|}}
* {slots.lnsNumRows()}} is the cardinality of the LHS table, or {{|L|}}
* {slots.rhsNdv()}} is the NDV of the right key or {{|R.k2|}}
* {slots.rnsNumRows()}} is the cardinality of the LHS table, or {{|R|}}

Translated to our notation:

{noformat}
adjNdv(L.k1) = |L.k1| * ( |𝜎L| / |L| if |L| > |𝜎L|, 1 otherwise )
adjNdv(R.k2) = |R.k2| * ( |𝜎R| / |R| if |R| > |𝜎R|, 1 otherwise )
|L’ ⋈ R’| = |L’| * |R’| / max(adjNdv(L.k1), adjNdv(R.k2))
{noformat}

We can make two simplifications:

* Since {{|X’| <= |X|}}, and when {{|𝜎X| = |X|}} then {{|X’| / |X| = 1}} we can drop the conditional term.
* Since {{|k| <= |X|}} and {{|𝜎X|}} is positive, we can see that neither {{adjNdv}} term can ever be less than one, so we can ignore the corresponding {{max()}} in the code.

This gives:

{noformat}
adjNdv(L.k1) = |L.k1| * |𝜎L| / |L|
adjNdv(R.k2) = |R.k2| * |𝜎R| / |R|

                    |L’| * |R’|
|L’ ⋈ R’| = -------------------------------
            max(adjNdv(L.k1), adjNdv(R.k2))
{noformat}

The meaning of the {{adjNdv}} values is that we reduce each NDV proportional to the the number of rows selected. If we take the adjusted NDVs as the correct values (but see below) the rest of the expression matches the derived expression.

h4. Code Bug

However, it *does not* make sense to adjust the NDVs for the M:N case — the one for which this code applies.

For example, suppose we start with 100 key values, and we select half the rows. How many keys are left?

* If we start with 100 rows, then, yes, we do reduce the keys by half.
* But, if we start with 10,000 rows, removing half the rows still leaves 5,000 rows to hold our 100 keys.

The intuition is right: the number of matches should decrease if the scan filter becomes more selective. But, that is exactly with the two scan term ({{|𝜎L|}} and {{|𝜎R|}} provide. It need not be done again for the keys.

The result of this bug that the code estimate will use NDVs that are too small, producing estimates which are too large, which throws of join selection and could degrade query performance if it leads to an inefficient plan.

That this attempted adjustment does not, in fact, work is not surprising. As noted in S&S (section 3.1): “We do not know of any algorithm that *correctly* takes into account both local predicates and join predicates.”

h4. Worked Example

Let's check with a detailed example:

* {{|L| = 10000}}
* {{|R| = 100}}
* {{|L’| = 3000}}
* {{|R’| = 50}}
* {{ndv(L.k1) = |L.k1| = 40}}
* {{ndv(R.k2) = |R.k2| = 20}}

Let's work this out intuitively first. The join is the normal cross join with adjustments:

{noformat}
|L’ ⋈ R’| = |L| * |R| * adjustments
{noformat}

Adjustments:

* The left side scan returns only 30% of the total rows. 
* The left hand has twice the key values of the right, so we have to discard half of the left rows.
* Of the L keys with a match in R, it will match, on average, 100/20 = 5 rows.
* But half of R is filtered out, reducing the available matching rows.

And:

{noformat}
|L’ ⋈ R’| = (|L| / 2) * (|R| / 20 / 2)
          = 10000 * 30% / 2 * 100 * / 20 / 2
          = 1500 * 2.5
          = 3750
{noformat}

Using the Equation 1 derived above above:

{noformat}
                |L’| * |R’|
|L’ ⋈ R’| = -------------------
            max(|L.k1|, |R.k2|)

          = 3000 * 50 / max(40, 20)
          = 3000 * 50 / 40
          = 3750
{noformat}

Which works out.

Now let's check the code's expression:

{noformat}
adjNdv(L.k1) = |L.k1| * |𝜎L| / |L|
             = 40 * 3000 / 10,000
             = 12
adjNdv(R.k2) = |R.k2| * |𝜎R| / |R|
             = 20 * 50 / 100
             = 10

                    |L’| * |R’|
|L’ ⋈ R’| = -------------------------------
            max(adjNdv(L.k1), adjNdv(R.k2))

          = 3000 * 50 / max(12, 10)
          = 150,000 / 12
          = 12,500
{noformat}

Which does not quite work out, verifying the bug. Note that, as a result of the bug, we overestimate the cardinality by 3 times.

h4. Compound Keys

See IMPALA-8014 for a discussion of compound keys (when the join is on more than one column.) We can use the same adjustment here as discussed in IMPALA-8014. The final M:N join cardinality estimation equation is:

{noformat}
                             |L’| * |R’|
|L’ ⋈ R’| = ---------------------------------------------      [Equation 2]
            max(min(∏ |L.k1i|, |L|), min(∏ |R.k2i|, |R|))
{noformat}

The astute reader will have noticed that, except for names, Equation 2 is the same as the final M:1 equation 4 from IMPALA-8014. IMPALA-8018 recognizes this and suggests the planner use one estimation algorithm, not two.

h4. Proposed Fix

The fix is simple, just use the correct expression, Equation 2 above.


> Incorrect cardinality calculation for the generic case
> ------------------------------------------------------
>
>                 Key: IMPALA-8015
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8015
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> TL;DR: the expression used to calculate the cardinality for a M:N (“generic”) join is incorrect. Scroll to the end for the proposed fix. See IMPALA-8018 for math details.
> h4. Current Implementation
> The code uses the following:
> {code:java}
>       double lhsAdjNdv = slots.lhsNdv();
>       if (slots.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard / slots.lhsNumRows();
>       double rhsAdjNdv = slots.rhsNdv();
>       if (slots.rhsNumRows() > rhsCard) rhsAdjNdv *= rhsCard / slots.rhsNumRows();
>       long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv, rhsAdjNdv))) *
>           rhsCard);
> {code}
> As noted in IMPALA-8014, the above attempts to adjust NDVs for the correlated filter case.
> Where:
> * {{lhsCard}} is the output cardinality of the l (previous join) table or {{|𝜎L|}}
> * {{rhsCard}} is the output cardinality of the r (scanned) table or {{|𝜎R|}}
> * {slots.lhsNdv()}} is the NDV of the left key or {{|L.k1|}}
> * {slots.lnsNumRows()}} is the cardinality of the LHS table, or {{|L|}}
> * {slots.rhsNdv()}} is the NDV of the right key or {{|R.k2|}}
> * {slots.rnsNumRows()}} is the cardinality of the LHS table, or {{|R|}}
> Translated to our notation:
> {noformat}
> adjNdv(L.k1) = |L.k1| * ( |𝜎L| / |L| if |L| > |𝜎L|, 1 otherwise )
> adjNdv(R.k2) = |R.k2| * ( |𝜎R| / |R| if |R| > |𝜎R|, 1 otherwise )
> |L’ ⋈ R’| = |L’| * |R’| / max(adjNdv(L.k1), adjNdv(R.k2))
> {noformat}
> We can make two simplifications:
> * Since {{|X’| <= |X|}}, and when {{|𝜎X| = |X|}} then {{|X’| / |X| = 1}} we can drop the conditional term.
> * Since {{|k| <= |X|}} and {{|𝜎X|}} is positive, we can see that neither {{adjNdv}} term can ever be less than one, so we can ignore the corresponding {{max()}} in the code.
> This gives:
> {noformat}
> adjNdv(L.k1) = |L.k1| * |𝜎L| / |L|
> adjNdv(R.k2) = |R.k2| * |𝜎R| / |R|
>                     |L’| * |R’|
> |L’ ⋈ R’| = -------------------------------
>             max(adjNdv(L.k1), adjNdv(R.k2))
> {noformat}
> The meaning of the {{adjNdv}} values is that we reduce each NDV proportional to the the number of rows selected. If we take the adjusted NDVs as the correct values (but see below) the rest of the expression matches the derived expression.
> h4. Code Bug
> However, it *does not* make sense to adjust the NDVs for the M:N case — the one for which this code applies.
> For example, suppose we start with 100 key values, and we select half the rows. How many keys are left?
> * If we start with 100 rows, then, yes, we do reduce the keys by half.
> * But, if we start with 10,000 rows, removing half the rows still leaves 5,000 rows to hold our 100 keys.
> The intuition is right: the number of matches should decrease if the scan filter becomes more selective. But, that is exactly with the two scan term ({{|𝜎L|}} and {{|𝜎R|}} provide. It need not be done again for the keys.
> The result of this bug that the code estimate will use NDVs that are too small, producing estimates which are too large, which throws of join selection and could degrade query performance if it leads to an inefficient plan.
> That this attempted adjustment does not, in fact, work is not surprising. As noted in S&S (section 3.1): “We do not know of any algorithm that *correctly* takes into account both local predicates and join predicates.”
> h4. Worked Example
> Let's check with a detailed example:
> * {{|L| = 10000}}
> * {{|R| = 100}}
> * {{|L’| = 3000}}
> * {{|R’| = 50}}
> * {{ndv(L.k1) = |L.k1| = 40}}
> * {{ndv(R.k2) = |R.k2| = 20}}
> Let's work this out intuitively first. The join is the normal cross join with adjustments:
> {noformat}
> |L’ ⋈ R’| = |L| * |R| * adjustments
> {noformat}
> Adjustments:
> * The left side scan returns only 30% of the total rows. 
> * The left hand has twice the key values of the right, so we have to discard half of the left rows.
> * Of the L keys with a match in R, it will match, on average, 100/20 = 5 rows.
> * But half of R is filtered out, reducing the available matching rows.
> And:
> {noformat}
> |L’ ⋈ R’| = (|L| / 2) * (|R| / 20 / 2)
>           = 10000 * 30% / 2 * 100 * / 20 / 2
>           = 1500 * 2.5
>           = 3750
> {noformat}
> Using the Equation 5 from IMPALA-8018 (which is Equation 2 in the S&S paper):
> {noformat}
>                 |L’| * |R’|
> |L’ ⋈ R’| = -------------------
>             max(|L.k1|, |R.k2|)
>           = 3000 * 50 / max(40, 20)
>           = 3000 * 50 / 40
>           = 3750
> {noformat}
> Which works out.
> Now let's check the code's expression:
> {noformat}
> adjNdv(L.k1) = |L.k1| * |𝜎L| / |L|
>              = 40 * 3000 / 10,000
>              = 12
> adjNdv(R.k2) = |R.k2| * |𝜎R| / |R|
>              = 20 * 50 / 100
>              = 10
>                     |L’| * |R’|
> |L’ ⋈ R’| = -------------------------------
>             max(adjNdv(L.k1), adjNdv(R.k2))
>           = 3000 * 50 / max(12, 10)
>           = 150,000 / 12
>           = 12,500
> {noformat}
> Which does not quite work out, verifying the bug. Note that, as a result of the bug, we overestimate the cardinality by 3 times.
> h4. Proposed Fix
> See IMPALA-8018 for the proposed fix for this and IMPALA-8014.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org