You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Simeon Simeonov (JIRA)" <ji...@apache.org> on 2016/01/19 03:07:39 UTC

[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

    [ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106124#comment-15106124 ] 

Simeon Simeonov edited comment on SPARK-12890 at 1/19/16 2:07 AM:
------------------------------------------------------------------

I've experienced this issue with a multi-level partitioned table loaded via {{sqlContext.read.parquet()}}. I'm not sure Spark is actually reading any data from the Parquet files but it does look at every Parquet file (perhaps reading meta-data?). I discovered this by accident because I had invalid Parquet files in the table tree left over from a failed job. Spark errored, which surprised me as I would have expected it to not look at any of the data when the query could be satisfied entirely through the partition columns. 

This is an important issue because it affects query speed for very large partitioned tables.


was (Author: simeons):
I've experienced this issue with a multi-level partitioned table loaded via `sqlContext.read.parquet()`. I'm not sure Spark is actually reading any data from the Parquet files but it does look at every Parquet file (perhaps reading meta-data?). I discovered this by accident because I had invalid Parquet files in the table tree left over from a failed job. Spark errored, which surprised me as I would have expected it to not look at any of the data when the query could be satisfied entirely through the partition columns. 

This is an important issue because it affects query speed for very large partitioned tables.

> Spark SQL query related to only partition fields should not scan the whole data.
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-12890
>                 URL: https://issues.apache.org/jira/browse/SPARK-12890
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org