You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/07/23 01:15:20 UTC

[jira] [Commented] (SPARK-16689) FileSourceStrategy: Pruning Partition Columns When No Partition Column Exist in Project

    [ https://issues.apache.org/jira/browse/SPARK-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390449#comment-15390449 ] 

Apache Spark commented on SPARK-16689:
--------------------------------------

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14322

> FileSourceStrategy: Pruning Partition Columns When No Partition Column Exist in Project
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-16689
>                 URL: https://issues.apache.org/jira/browse/SPARK-16689
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Xiao Li
>
> For partitioned file sources, the current implementation always scans all the partition columns. However, this is not necessary when the projected column list does not include any partition column. In addition, we also can avoid the unnecessary Project.
> Below is an example,
> {noformat}
> spark
>   .range(N)
>   .selectExpr("id AS value1", "id AS value2", "id AS p1", "id AS p2", "id AS p3")
>   .toDF("value", "value2", "p1", "p2", "p3").write.format("json")
>   .partitionBy("p1", "p2", "p3").save(tempDir)
> spark.read.format("json").load(tempDir).selectExpr("value")
> {noformat}
> Before the PR changes, the physical plan is like:
> {noformat}
> == Physical Plan ==
> *Project [value#37L]
> +- *Scan json [value#37L,p1#39,p2#40,p3#41] Format: JSON, InputPaths: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-f7a4294a-2e1b-4f44-9ebb-1a5eb..., PushedFilters: [], ReadSchema: struct<value:bigint>
> {noformat}
> After the PR changes, the physical plan becomes:
> {noformat}
> == Physical Plan ==
> *Scan json [value#147L] Format: JSON, InputPaths: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a5bcb14a-46c2-4c20-8f34-9662b..., PushedFilters: [], ReadSchema: struct<value:bigint>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org