You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Rick Kramer (JIRA)" <ji...@apache.org> on 2018/01/09 21:00:04 UTC

[jira] [Created] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables

Rick Kramer created SPARK-23012:
-----------------------------------

             Summary: Support for predicate pushdown and partition pruning when left joining large Hive tables
                 Key: SPARK-23012
                 URL: https://issues.apache.org/jira/browse/SPARK-23012
             Project: Spark
          Issue Type: Improvement
          Components: Optimizer
    Affects Versions: 2.2.0
            Reporter: Rick Kramer


We have a hive view which left outer joins several large, partitioned orc hive tables together on date. When the view is used in a hive query, hive pushes date predicates down into the joins and prunes the partitions for all tables. When I use this view from pyspark, the predicate is only used to prune the left-most table and all partitions from the additional tables are selected.

For example, consider two partitioned hive tables a & b joined in a view:

create table a (
   a_val string
)
partitioned by (ds string)
stored as orc;

create table b (
   b_val string
)
partitioned by (ds string)
stored as orc;

create view example_view as
select
    a_val
    , b_val
    , ds
from a 
left outer join b on b.ds = a.ds

Then in pyspark you might try to query from the view filtering on ds:

spark.table('example_view').filter(F.col('ds') == '2018-01-01')

If table a and b are large, this results in a plan that filters a on ds = 2018-01-01, but selects scans all partitions of table b.

If the join in the view is changed to an inner join, the predicate gets pushed down to a & b and the partitions are pruned as you'd expect.

In practice, the view is fairly complex and contains a lot of business logic we'd prefer not to replicate in pyspark if we can avoid it.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org