You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiao Li (JIRA)" <ji...@apache.org> on 2018/07/25 03:48:00 UTC

[jira] [Resolved] (SPARK-23957) Sorts in subqueries are redundant and can be removed

     [ https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiao Li resolved SPARK-23957.
-----------------------------
       Resolution: Fixed
         Assignee: Henry Robinson
    Fix Version/s: 2.4.0

> Sorts in subqueries are redundant and can be removed
> ----------------------------------------------------
>
>                 Key: SPARK-23957
>                 URL: https://issues.apache.org/jira/browse/SPARK-23957
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Henry Robinson
>            Assignee: Henry Robinson
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Unless combined with a {{LIMIT}}, there's no correctness reason that planned and optimized subqueries should have any sort operators (since the result of the subquery is an unordered collection of tuples). 
> For example:
> {{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
> has the following plan:
> {code:java}
> == Physical Plan ==
> *(3) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>    +- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
>       +- *(2) Project
>          +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
>             +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>                +- *(1) Project [id#0L]
>                   +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
> {code}
> ... but the sort operator is redundant.
> Less intuitively, the sort is also redundant in selections from an ordered subquery:
> {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
> has plan:
> {code:java}
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>    +- *(1) Project [id#0L]
>       +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
> {code}
> ... but again, since the subquery returns a bag of tuples, the sort is unnecessary.
> We should consider adding an optimizer rule that removes a sort inside a subquery. SPARK-23375 is related, but removes sorts that are functionally redundant because they perform the same ordering.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org