You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Lantao Jin (Jira)" <ji...@apache.org> on 2019/12/04 01:47:00 UTC

[jira] [Updated] (SPARK-30114) Improve LIMIT only query by partial listing files

     [ https://issues.apache.org/jira/browse/SPARK-30114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lantao Jin updated SPARK-30114:
-------------------------------
    Summary: Improve LIMIT only query by partial listing files  (was: Optimize LIMIT only query by partial listing files)

> Improve LIMIT only query by partial listing files
> -------------------------------------------------
>
>                 Key: SPARK-30114
>                 URL: https://issues.apache.org/jira/browse/SPARK-30114
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Lantao Jin
>            Priority: Major
>
> We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT operation like
> 1) SELECT * FROM TABLE_A LIMIT N
> 2) SELECT colA FROM TABLE_A LIMIT N
> 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
> If the TABLE_A is a large table (a RDD with thousands and thousands of partitions), the execution time would be very big since it has to list all files to build a RDD before execution. But almost time, the N is just like 10, 100, 1000, not very big. We don't need to scan all files. This optimization will create a *SinglePartitionReadRDD* to address it.
> In our production result, this optimization benefits a lot. The duration time of simple query with LIMIT could reduce 5~10 times. For example, before this optimization, a query on a table which has about one hundred thousands files would run over 30 seconds, after applying this optimization, the time decreased to 5 seconds.
> Should support both Spark DataSource Table and Hive Table which can be converted to DataSource table.
> Should support bucket table, partition table, normal table.
> Should support different file formats like parquet, orc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org