You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sunil Srivatsa (JIRA)" <ji...@apache.org> on 2016/10/21 22:53:59 UTC
[jira] [Updated] (SPARK-18059) Spark not respecting partitions in partitioned Hive views

     [ https://issues.apache.org/jira/browse/SPARK-18059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sunil Srivatsa updated SPARK-18059:
-----------------------------------
    Description: 
For Hive partitioned views (https://cwiki.apache.org/confluence/display/Hive/PartitionedViews), when you specify a partition Spark reads from all partitions rather than just the specified ones.  For example:

create table srcpart_1 (key int, value string) partitioned by (ds string);
create table srcpart_2 (key int, value string) partitioned by (ds string);
CREATE VIEW vp2
PARTITIONED ON (ds)
AS
SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds
FROM srcpart_1
join srcpart_2 
on srcpart_1.key = srcpart_2.key
and srcpart_1.ds=srcpart_2.ds;
ALTER VIEW vp2 ADD PARTITION (ds='2016-01-01')

When a query executes, e.g.:
SELECT key, value FROM vp2 WHERE ds = '2016-01-01'

Spark reads Parquet files for all partitions.  This is easy to observe in a job by printing out df.inputFiles.  The Hive explains for executing that query against the view are the same as a normal join, which suggests Hive is behaving correctly:
SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds FROM srcpart_1
JOIN srcpart_2 ON srcpart_1.key = srcpart_2.key WHERE srcpart_1.ds = '2016-01-01' and srcpart_2.ds = '2016-01-01'

  was:
For Hive partitioned views (https://cwiki.apache.org/confluence/display/Hive/PartitionedViews), when you specify a partition Spark reads from all partitions rather than just the specified ones.  For example:

create table srcpart_1 (key int, value string) partitioned by (ds string);
create table srcpart_2 (key int, value string) partitioned by (ds string);
CREATE VIEW vp2
PARTITIONED ON (ds)
AS
SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds
FROM srcpart_1
join srcpart_2 
on srcpart_1.key = srcpart_2.key
and srcpart_1.ds=srcpart_2.ds;
ALTER VIEW vp2 ADD PARTITION (ds='2016-10-01')

When a query executes, e.g.:
SELECT key, value FROM vp2 WHERE ds = '2016-01-01'

Spark reads Parquet files for all partitions.  This is easy to observe in a job by printing out df.inputFiles.  The Hive explains for executing that query against the view are the same as a normal join, which suggests Hive is behaving correctly:
SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds FROM srcpart_1
JOIN srcpart_2 ON srcpart_1.key = srcpart_2.key WHERE srcpart_1.ds = '2016-01-01' and srcpart_2.ds = '2016-01-01'


> Spark not respecting partitions in partitioned Hive views
> ---------------------------------------------------------
>
>                 Key: SPARK-18059
>                 URL: https://issues.apache.org/jira/browse/SPARK-18059
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.5.1
>         Environment: ResourceManager version:	2.6.0-cdh5.4.2 from 15b703c8725733b7b2813d2325659eb7d57e7a3f by jenkins source checksum e7a085479aa1989b5cecfabea403549 on 2015-05-20T00:05Z
> Hadoop version:	2.6.0-cdh5.4.2 from 15b703c8725733b7b2813d2325659eb7d57e7a3f by jenkins source checksum de74f1adb3744f8ee85d9a5b98f90d on 2015-05-19T23:58Z
>            Reporter: Sunil Srivatsa
>
> For Hive partitioned views (https://cwiki.apache.org/confluence/display/Hive/PartitionedViews), when you specify a partition Spark reads from all partitions rather than just the specified ones.  For example:
> create table srcpart_1 (key int, value string) partitioned by (ds string);
> create table srcpart_2 (key int, value string) partitioned by (ds string);
> CREATE VIEW vp2
> PARTITIONED ON (ds)
> AS
> SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds
> FROM srcpart_1
> join srcpart_2 
> on srcpart_1.key = srcpart_2.key
> and srcpart_1.ds=srcpart_2.ds;
> ALTER VIEW vp2 ADD PARTITION (ds='2016-01-01')
> When a query executes, e.g.:
> SELECT key, value FROM vp2 WHERE ds = '2016-01-01'
> Spark reads Parquet files for all partitions.  This is easy to observe in a job by printing out df.inputFiles.  The Hive explains for executing that query against the view are the same as a normal join, which suggests Hive is behaving correctly:
> SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds FROM srcpart_1
> JOIN srcpart_2 ON srcpart_1.key = srcpart_2.key WHERE srcpart_1.ds = '2016-01-01' and srcpart_2.ds = '2016-01-01'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org