You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2018/09/19 01:35:00 UTC
[jira] [Comment Edited] (IMPALA-7560) Better selectivity estimate
for != (not equals) binary predicate
[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619827#comment-16619827 ]
Paul Rogers edited comment on IMPALA-7560 at 9/19/18 1:34 AM:
--------------------------------------------------------------
FWIW, it turns out that Apache Drill did a similar analysis to work out rules based on the classic defaults plus some reasoning about probability: DRILL-5254
For Drill, since only the "classic" estimates (not stats) are available, the probabilities don't work out because of the conditional probability implied when a user selects one operator vs. another. But, the math reasoning might be used for this ticket if we do have stats to work with.
was (Author: paul.rogers):
Turns out that Apache Drill did a similar analysis to work out rules based on the classic defaults plus some reasoning about probability: DRILL-5254
For Drill, since only the "classic" estimates (not stats) are available, the probabilities don't work out because of he conditional probability of a user using one operator vs. another. But, the math reasoning might be used here if we do have stats to work with.
> Better selectivity estimate for != (not equals) binary predicate
> ----------------------------------------------------------------
>
> Key: IMPALA-7560
> URL: https://issues.apache.org/jira/browse/IMPALA-7560
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.12.0, Impala 2.13.0
> Reporter: bharath v
> Priority: Major
>
> Currently we use the default selectivity estimate for any binary predicate with op other than EQ / NON_DISTINCT.
> {noformat}
> // Determine selectivity
> // TODO: Compute selectivity for nested predicates.
> // TODO: Improve estimation using histograms.
> Reference<SlotRef> slotRefRef = new Reference<SlotRef>();
> if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
> && isSingleColumnPredicate(slotRefRef, null)) {
> long distinctValues = slotRefRef.getRef().getNumDistinctValues();
> if (distinctValues > 0) {
> selectivity_ = 1.0 / distinctValues;
> selectivity_ = Math.max(0, Math.min(1, selectivity_));
> }
> }
> {noformat}
> This can give very conservative estimates. For example:
> {noformat}
> [localhost:21000] tpch> select * from nation where n_regionkey != 1;
> [localhost:21000] tpch> summary;
> +--------------+--------+----------+----------+-------+------------+-----------+---------------+-------------+
> | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak Mem | Est. Peak Mem | Detail |
> +--------------+--------+----------+----------+-------+------------+-----------+---------------+-------------+
> | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20* | *3* | 143.00 KB | 16.00 MB | tpch.nation |
> +--------------+--------+----------+----------+-------+------------+-----------+---------------+-------------+
> [localhost:21000] tpch>
> {noformat}
> Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can give better estimate.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org