You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/04 02:04:26 UTC

[GitHub] [spark] LantaoJin opened a new pull request #26754: [SPARK-30115][SQL] Improve limit only query on datasource table

LantaoJin opened a new pull request #26754: [SPARK-30115][SQL] Improve limit only query on datasource table
URL: https://github.com/apache/spark/pull/26754
 
 
   ### What changes were proposed in this pull request?
   
   We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT operation like
   1) SELECT * FROM TABLE_A LIMIT N
   2) SELECT colA FROM TABLE_A LIMIT N
   3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
   If the TABLE_A is a large table (a RDD with thousands and thousands of partitions), the execution time would be very big since it has to list all files to build a RDD before execution. But almost time, the N is just like 10, 100, 1000, not very big. We don't need to scan all files. This optimization will create a **SinglePartitionReadRDD** to address it.
   
   In our production result, this optimization benefits a lot. The duration time of simple query with LIMIT could reduce 5~10 times. For example, before this optimization, a query on a table which has about one hundred thousands files would run over 30 seconds, after applying this optimization, the time decreased to 5 seconds.
   
   This PR only addresses Spark datasource table.
   Hive table and view will be filed after this merged.
   
   ### How to implement?
   1. Add two configurations, `PARTIAL_LISTING_ENABLED` and `PARTIAL_LISTING_MAX_FILES`
   2. In `FindDataSourceTable.apply()`, we resolve `GlobalLimit` to add a flag `partialListing` to `FileIndex`
   3. In `DataSourceScanExec.inputRDD`, by checking the flag `partialListing` in `relation.location`, we create a `SinglePartitionReadRDD`. This RDD will assign less than `PARTIAL_LISTING_MAX_FILES` files to a single partition.
   
   ### Does this PR introduce any user-facing change?
   No
   
   
   ### How was this patch tested?
   Add LimitOnlyQuerySuite

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org