You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by LakeShen <sh...@gmail.com> on 2023/10/01 08:30:27 UTC

Re: Existing rules for duplication of filter conditions above joins which should not be pushed down by FilterIntoJoinRule ?

In the Trino,there are two kinds of predicates if the predicates could be
pushed: 1. unenforced predicate 2. enforced predicate unenforced predicate
means that a predicate could be pushed down, but the bottom layer will not
guarantee that the data completely conforms to the predicate, so we still
need to retain this predicate in the upper layer. The bottom layer could
use the push-down predicate to do some advanced data filtering, such as
using the Parquet files row-group metadata information to filter data. The
enforced predicate means that the underlying layer could guarantee the
data, so the predicate could be completely pushed down directly, such as
the predicate of partitioning.Enforced predicate pushdown must be a better
optimization method. For unenforced predicate, since there will be a
repeated predicate in the plan, whether it needs to be pushed down should
be determined in conjunction with CBO framework, because we do not know
whether it will be the optimal plan after pushing down.As Julian
said,maybe predicate
is also expensive and not very selective.I think TPC-H Q19 is an example of
unenforced predicate.
So you should combine it with the CBO framework to determine whether a
predicate needs to be repeated.

Best,
LakeShen

Julian Hyde <jh...@gmail.com> 于2023年10月1日周日 05:54写道:

> Thanks, Ian. That’s useful.
>
> My concern with duplication of filter conditions is that we give the lower
> relational expression permission to ignore them. Suppose we have a query
> (Q1):
>
>   select *
>   from emp
>     left join dept
>     using (deptno)
>   where myPredicate(dept.dname)
>
> We could duplicate the filter, like this (Q2):
>
>   select *
>   from emp
>     left join (select * from dept where myPredicate(dept.dname) as dept
>     using (deptno)
>   where myPredicate(dept.dname)
>
> As in your case, we have to keep the predicate because it is not strong —
> it does not necessarily return unknown if its argument is null.
>
> But suppose the predicate is also expensive and not very selective. The
> subquery needs to know that it can safely ignore the filter.
>
> Is it sufficient to just mark Q1 and Q2 as equivalent, and let cost-based
> optimization do its thing? Possibly, but I’m not totally sure. Maybe
> there’s a way to tag a filter as optional in metadata.
>
> By the way, this kind of predicate comes up as a semi-join condition. If
> we have a 3-way join
>
>   (R left join S) left join T
>
> we may want to convert it to
>
>   ((R where exists (select from T where …)
>     left join S)
>   left join T
>
> The ‘exists’ semi join is probably expensive - it performs a join, after
> all - but if it is highly selective, and if a cheap approximation can be
> found (say a bloom filter) it is worth doing.
>
> Julian
>
>
>
> > On Sep 27, 2023, at 1:53 PM, Ian Bertolacci <ia...@workday.com.INVALID>
> wrote:
> >
> > Hi Stamatis,
> > Thanks for pointing out those rules.
> >
> > Unfortunately, I think my initial premise was wrong.
> > Thanks!
> > -Ian
> >
> > (elaborating mostly for future readers)
> > For example, say I have these two tables:
> >
> > T1
> > | Id | ForeignKey
> > | 1  | 10
> > | 2  | 20
> > | 3  | 30
> > | 4  | 40
> >
> > T2
> > | Id | SomeValue
> > | 10 | 123
> > | 20 | 456
> > | 30 | null
> >
> > Doing a simple left join
> >
> > select * from T1 left join T2 on T1.ForeignKey = T2.Id
> >
> > | T1.Id | T1.ForeignKey | T2.id | T2.SomeValue
> > | 1     | 10            | 10    | 123
> > | 2     | 20            | 20    | 456
> > | 3     | 30            | 30    | null
> > | 4     | 40            | null  | null
> >
> > Now applying a filter *above* the join which permits nulls
> >
> > select * from T1 left join T2 on T1.ForeignKey = T2.Id where
> T2.SomeValue = 123 or T2.SomeValue is null
> >
> > | T1.Id | T1.ForeignKey | T2.id | T2.SomeValue
> > | 1     | 10            | 10    | 123
> > | 3     | 30            | 30    | null
> > | 4     | 40            | null  | null
> > Tuple for T1.id = 2 has has been filtered out because T2.SomeValue = 456
> is neither 123 nor null
> >
> > Now let’s duplicate the filter condition above and below the join
> > select * from T1 left join (select * from T2 where T2.SomeValue = 123 or
> T2.SomeValue is null) as T2 on T1.ForeignKey = T2.Id where T2.SomeValue =
> 123 or T2.SomeValue is null
> >
> > | T1.Id | T1.ForeignKey | T2.id | T2.SomeValue
> > | 1     | 10            | 10    | 123
> > | 2     | 20            | null  | null
> > | 3     | 30            | 30    | null
> > | 4     | 40            | null  | null
> >
> > Tuple for T2.id = 20 was filtered out under the join, which creates null
> values for the match of T1 from the left join, but the row remains because
> the filter condition permits null values.
> >
> >
> >
> > On 2023/09/27 08:38:50 Stamatis Zampetakis wrote:
> >> Hey Ian,
> >>
> >> I don't think there is such a rule in Calcite but you may find similar
> >> ideas in rules present in other projects.
> >>
> >> In Hive for instance, there is the HivePreFilteringRule [1, 2] that
> >> pushes "redundantly" some filters below other operations in the tree.
> >> What is inherently problematic with all these rules that introduce
> >> "duplicate" predicates in the plan is that they require some kind of
> >> state to prevent infinite matching. Additionally, you need to pay
> >> attention when you put together rules that are pushing and pulling
> >> since they can interact badly with each other.
> >>
> >> Best,
> >> Stamatis
> >>
> >> [1]
> https://github.com/apache/hive/blob/57f096c9a73eb92806f2a7cc97f87fabf5d546fe/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePreFilteringRule.java
> >> [2]
> https://issues.apache.org/jira/browse/HIVE-9069?focusedCommentId=14534098&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14534098
> >>
> >> On Tue, Sep 26, 2023 at 10:10 PM Ian Bertolacci
> >> <ia...@workday.com.inva>lid> wrote:
> >>>
> >>> Hi,
> >>> I was wondering if there exist any rules to duplicate filters which
> exist above the join, whose effect is dependent on the result of the join
> and therefore cannot be *pushed* below a join, but could be *duplicated*
> below the join.
> >>>
> >>> For example: `select … from A LEFT join B on … where B.field is null`
> >>> Here, the best we could do is push the filter condition into the join
> condition, but not necessarily below it, because the null-ness of the
> column is partially dependent on the result of the join.
> >>> However, in this case we can duplicate the condition below the join:
> >>> `select … from A LEFT join (select … from B where B.field is null) as
> B on … where B.field is null`
> >>> This is because the condition because the null-ness of the column is
> also partially dependent value of the column.
> >>> With both of these filters in place we capture instances of B which
> are null because the column is null and because there was no match to B
> >>> This (1) reduced the cardinality of that side of the join, and (2)
> maintained the original intent of the query.
> >>>
> >>> In this example, I use `is null` but we would like to do this for some
> of our custom comparison operators.
> >>> For these operators, we cannot do push-down (because it would change
> the intent of the original query) but doing filter duplication should be
> fine (though we’re still making sure of that).
> >>>
> >>> I figure that this probably doesn’t exist, in which case I’ll probably
> use FilterIntoJoinRule as a jumping off point.
> >>> Any other suggestions?
> >>>
> >>> Thanks!
> >>> -Ian J. Bertolacci
> >>
>
>