You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/06/21 04:36:00 UTC
[jira] [Commented] (IMPALA-7608) Estimate row count from file size when no stats available

    [ https://issues.apache.org/jira/browse/IMPALA-7608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16869178#comment-16869178 ] 

ASF subversion and git services commented on IMPALA-7608:
---------------------------------------------------------

Commit b3b00da1a1c7b98e84debe11c10258c4a0dff944 in impala's branch refs/heads/master from Fang-Yu Rao
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b3b00da ]

IMPALA-7608: Estimate row count from file size when no stats available

Added the feature that computes an estimated number of rows in the current
hdfs table if the statistics for the cardinality of the current hdfs table is not
available.

Also added an additional query option to revert the change in case of regression.

Testing:
(1) In CardinalityTest.java, replaced the original statement
"verifyCardinality("SELECT a FROM functional.tinytable", -1);" in
the method testBasicsWithoutStats() with
"verifyCardinality("SELECT a FROM functional.tinytable", 2);".
(2) In CarginalityTest.java, added more tests to check the cardinality
of most PlanNode implementations. For each tested PlanNode, the behaviors
before and after we disable the feature are both tested.
(3) In set.test, modified three related test cases to make sure that
the added query option is included after executing "set all" in various
scenarios.
(4) There are 8 JUnit tests in PlannerTest.java that would produce different
distributed query plans when this feature is enabled. Added an additional
JUnit test for 6 of those 8 affected JUnit tests when this feature is
enabled. Specifically, each tested query in a newly added test files involves
at least one hdfs table without available statistics.
We do not add test cases for 2 of the affected JUnit tests when this feature
is enabled since it results in flaky tests. These two JUnit tests are
testResourceRequirements() and testSpillableBufferSizing(). In this patch
we only test them when the feature is disabled.
(5) There are 5 Python end to end tests that consist of queries that would
produce different results. Added an additional query for each affected query
when this feature is disabled.

Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a
Reviewed-on: http://gerrit.cloudera.org:8080/12974
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Estimate row count from file size when no stats available
> ---------------------------------------------------------
>
>                 Key: IMPALA-7608
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7608
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Paul Rogers
>            Assignee: Fang-Yu Rao
>            Priority: Major
>
> Impala makes heavy use of stats, which is a good thing. Stats feed into query planning where they allow the planner to choose among a fixed set of alternatives such as: do I put t1 on the build or probe side of a join?
> Because the planner decisions tend to be discrete, we only need enough information to decide whether to do A or B (or, more generally, to choose among a set of choices A, B, C, ... N).
> Often data sizes are vastly different on different paths. Stats help refine these numbers, but much of the information just needs to be in the ball park: is table t1 larger or smaller than t2? Often, one table is much larger than the other, so even a rough size estimate will force the right decision (put the smaller table on the build side of a join.)
> Today, if Impala has no stats, it refuses to even consider table size. Consider the following unit test:
> {noformat}
>     runTest("SELECT a FROM functional.tinytable;", -1);
> {noformat}
> This plans the given query, then verifies that the expected result cardinality is the number given. In this case, {{tinytable}} has no stats. So, we don't know the cardinality. OK...
> The table turns out to be 3 rows. Perhaps I join this to a hypothetical {{hugetable}} of 1 million rows. Without even a guess at cardinality, Impala can't choose a good plan.
> The suggestion is to use table size to estimate row cardinality. Come up with some assumed row width, say 100. Then, estimate row count as {{file size / est. row width}}. This gives a ballpark number that would be plenty good for the planner to choose the proper plan much of the time. 
> Since this is such an easy estimate to make, and will address the occasional case in which stats are not available, it seems a shame to not take advantage of this information.
> In terms of implementation, {{HdfsScanNode.computeCardinalities()}} already uses some extrapolation, if enabled. It can be extended to do the last-ditch extrapolation suggested above if, after the current techniques, the cardinality is still undefined.
> If we apply this simple fix in a prototype build, the new test result is closer to reality:
> {noformat}
>     runTest("SELECT a FROM functional.tinytable;", 1);
> {noformat}
> Given that the fix is so simple, any reason not to use the file size, when available? Is 100 a reasonable assumed row width? Should this functionality always be on, not just when enabled using the back-end config?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org