You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Fang-Yu Rao (Jira)" <ji...@apache.org> on 2020/01/23 23:07:00 UTC

[jira] [Resolved] (IMPALA-7608) Estimate row count from file size when no stats available

     [ https://issues.apache.org/jira/browse/IMPALA-7608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fang-Yu Rao resolved IMPALA-7608.
---------------------------------
    Fix Version/s: Impala 3.3.0
                   Impala 3.4.0
       Resolution: Resolved

> Estimate row count from file size when no stats available
> ---------------------------------------------------------
>
>                 Key: IMPALA-7608
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7608
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Paul Rogers
>            Assignee: Fang-Yu Rao
>            Priority: Major
>             Fix For: Impala 3.4.0, Impala 3.3.0
>
>
> Impala makes heavy use of stats, which is a good thing. Stats feed into query planning where they allow the planner to choose among a fixed set of alternatives such as: do I put t1 on the build or probe side of a join?
> Because the planner decisions tend to be discrete, we only need enough information to decide whether to do A or B (or, more generally, to choose among a set of choices A, B, C, ... N).
> Often data sizes are vastly different on different paths. Stats help refine these numbers, but much of the information just needs to be in the ball park: is table t1 larger or smaller than t2? Often, one table is much larger than the other, so even a rough size estimate will force the right decision (put the smaller table on the build side of a join.)
> Today, if Impala has no stats, it refuses to even consider table size. Consider the following unit test:
> {noformat}
>     runTest("SELECT a FROM functional.tinytable;", -1);
> {noformat}
> This plans the given query, then verifies that the expected result cardinality is the number given. In this case, {{tinytable}} has no stats. So, we don't know the cardinality. OK...
> The table turns out to be 3 rows. Perhaps I join this to a hypothetical {{hugetable}} of 1 million rows. Without even a guess at cardinality, Impala can't choose a good plan.
> The suggestion is to use table size to estimate row cardinality. Come up with some assumed row width, say 100. Then, estimate row count as {{file size / est. row width}}. This gives a ballpark number that would be plenty good for the planner to choose the proper plan much of the time. 
> Since this is such an easy estimate to make, and will address the occasional case in which stats are not available, it seems a shame to not take advantage of this information.
> In terms of implementation, {{HdfsScanNode.computeCardinalities()}} already uses some extrapolation, if enabled. It can be extended to do the last-ditch extrapolation suggested above if, after the current techniques, the cardinality is still undefined.
> If we apply this simple fix in a prototype build, the new test result is closer to reality:
> {noformat}
>     runTest("SELECT a FROM functional.tinytable;", 1);
> {noformat}
> Given that the fix is so simple, any reason not to use the file size, when available? Is 100 a reasonable assumed row width? Should this functionality always be on, not just when enabled using the back-end config?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)