You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/06 07:52:55 UTC

[GitHub] [spark] gengliangwang opened a new pull request #28741: [SPARK-31919][SQL] Push down more predicates through Join

gengliangwang opened a new pull request #28741:
URL: https://github.com/apache/spark/pull/28741

### What changes were proposed in this pull request?

Currently, in the rule`PushPredicateThroughJoin`, if the condition predicate of `Or` operator can't be entirely pushed down, it will be thrown away.
In fact, the predicates under `Or` operators can be partially pushed down.
For example, says `a` and `b` are able to be pushed into one of the joined tables, while `c` can't be pushed down, the predicate
`a or (b and c)`
can be converted as
`(a or b) and (a or c)`
We can still push down `(a or b)`.
We can't push down disjunctive predicates only when one of its children is not partially convertible.

The common way is to convert the condition into conjunctive normal form(CNF), so that we can find all the predicates that can be pushed down by going over the CNF predicate.
However, CNF conversion result can be huge, the recursive implementation can cause stack overflow on complex predicates. There were PRs for it such as #10444, #15558, #28575.
There is also non-recursive implementation: #28733 . It should be workable but this PR proposes a simpler implementation.

Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in https://github.com/apache/spark/pull/24598 . There is no need to maintain the CNF result set.

### Why are the changes needed?

Improve query performance. PostgreSQL, Impala and Hive support similiar feature.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test and benchmark test.

SQL | Before this PR | After this PR
--- | --- | ---
TPCDS 5T Q13 | 84s | 21s
TPCDS 5T q85 | 66s | 34s
TPCH 1T q19 | 37s | 32s

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org