You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:38:04 UTC
[jira] [Resolved] (SPARK-13908) Limit not pushed down
[ https://issues.apache.org/jira/browse/SPARK-13908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-13908.
----------------------------------
Resolution: Incomplete
> Limit not pushed down
> ---------------------
>
> Key: SPARK-13908
> URL: https://issues.apache.org/jira/browse/SPARK-13908
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Environment: Spark compiled from git with commit 53ba6d6
> Reporter: Luca Bruno
> Priority: Major
> Labels: bulk-closed, performance
>
> Hello,
> I'm doing a simple query like this on a single parquet file:
> {noformat}
> SELECT *
> FROM someparquet
> LIMIT 1
> {noformat}
> The someparquet table is just a parquet read and registered as temporary table.
> The query takes as much time (minutes) as it would by scanning all the records, instead of just taking the first record.
> Using parquet-tools head is instead very fast (seconds), hence I guess it's a missing optimization opportunity from spark.
> The physical plan is the following:
> {noformat}
> == Physical Plan ==
> CollectLimit 1
> +- WholeStageCodegen
> : +- Scan ParquetFormat part: struct<>, data: struct<........>[...] InputPaths: hdfs://...
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org