You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Varun Rao <va...@cloudera.com> on 2019/04/28 18:36:36 UTC

Predicate Push Down Vs On Clause

When performing a join in Hive and then filtering the output with a where
clause, the Hive compiler will try to filter data before the tables are
joined. This is known as predicate pushdown (
http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)

For example:

SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6

Rows from table a which have some_name = 6 will be filtered before
performing the join, if push down predicates are enabled(hive.optimize.ppd).

However, I have also learned recently that there is another way of
filtering data from a table before joining it with another table(
https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/
).

One can provide the condition in the ON clause, and table a will be
filtered before the join is performed

For example:

SELECT * FROM a JOIN b  ON a.some_id=b.some_other_id AND a.some_name=6

Are these both valid ways of filtering data before joins?

Thank you

Yours Truly,
Varun Rao

Re: Predicate Push Down Vs On Clause

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> Yes both of these are valid ways of filtering data before join in Hive.

This has several implementation specifics attached to it. If you're looking at Hive 1.1 or before, it might not work the same way as Vineet mentioned.

In older versions Calcite rewrites aren't triggered, which prevented some of the PPD rewrites.

This became massively better after the pre-join transforms in Calcite were made as a default, even without statistics.

I think this actually had a number of fixes in Hive 1.2.1, which also did column pruning through the join (i.e the condition column is no longer part of the join values).

Hive3+ the column pruning is replaced by substitution, so you won't send the whole column through, but evaluate it to a boolean before joining (see TPC-DS query2).

> As long as the join is not outer and the ON condition is not on non-null generating side of join Hive planner will try to push the predicate down to table scan.

Since I learned it the hard  way a few years back, here's an example from tpc-ds where the where clause does something different from the on clause.

https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/perf/query93.q

if the reason condition was an on-clause, then the join would produce different results. Because it is in the where clause, the 'Did not like the warranty' condition changes the join type on optimization.

Cheers,
Gopal



Re: Predicate Push Down Vs On Clause

Posted by Vineet Garg <vg...@apache.org>.
Hi Varun,

Yes both of these are valid ways of filtering data before join in Hive.

As long as the join is not outer and the ON condition is not on non-null
generating side of join Hive planner will try to push the predicate down to
table scan.
In fact Hive goes one step ahead and also generate IS NOT NULL predicate on
join keys (e.g. a.some_id IS NOT NULL, b.some_other_id IS NOT NULL) and
push is down to table scan if possible.

Regards,
Vineet Garg


On Sun, Apr 28, 2019 at 11:54 AM Varun Rao <va...@cloudera.com> wrote:

> When performing a join in Hive and then filtering the output with a where
> clause, the Hive compiler will try to filter data before the tables are
> joined. This is known as predicate pushdown (
> http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)
>
> For example:
>
> SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6
>
> Rows from table a which have some_name = 6 will be filtered before
> performing the join, if push down predicates are enabled(hive.optimize.ppd).
>
> However, I have also learned recently that there is another way of
> filtering data from a table before joining it with another table(
> https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/
> ).
>
> One can provide the condition in the ON clause, and table a will be
> filtered before the join is performed
>
> For example:
>
> SELECT * FROM a JOIN b  ON a.some_id=b.some_other_id AND a.some_name=6
>
> Are these both valid ways of filtering data before joins?
>
> Thank you
>
> Yours Truly,
> Varun Rao
>