You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rick Kramer (JIRA)" <ji...@apache.org> on 2018/01/09 21:00:04 UTC
[jira] [Created] (SPARK-23012) Support for predicate pushdown and
partition pruning when left joining large Hive tables
Rick Kramer created SPARK-23012:
-----------------------------------
Summary: Support for predicate pushdown and partition pruning when left joining large Hive tables
Key: SPARK-23012
URL: https://issues.apache.org/jira/browse/SPARK-23012
Project: Spark
Issue Type: Improvement
Components: Optimizer
Affects Versions: 2.2.0
Reporter: Rick Kramer
We have a hive view which left outer joins several large, partitioned orc hive tables together on date. When the view is used in a hive query, hive pushes date predicates down into the joins and prunes the partitions for all tables. When I use this view from pyspark, the predicate is only used to prune the left-most table and all partitions from the additional tables are selected.
For example, consider two partitioned hive tables a & b joined in a view:
create table a (
a_val string
)
partitioned by (ds string)
stored as orc;
create table b (
b_val string
)
partitioned by (ds string)
stored as orc;
create view example_view as
select
a_val
, b_val
, ds
from a
left outer join b on b.ds = a.ds
Then in pyspark you might try to query from the view filtering on ds:
spark.table('example_view').filter(F.col('ds') == '2018-01-01')
If table a and b are large, this results in a plan that filters a on ds = 2018-01-01, but selects scans all partitions of table b.
If the join in the view is changed to an inner join, the predicate gets pushed down to a & b and the partitions are pruned as you'd expect.
In practice, the view is fairly complex and contains a lot of business logic we'd prefer not to replicate in pyspark if we can avoid it.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org