You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Henry Robinson (JIRA)" <ji...@apache.org> on 2018/04/10 22:52:00 UTC

[jira] [Created] (SPARK-23957) Sorts in subqueries are redundant and can be removed

Henry Robinson created SPARK-23957:
--------------------------------------

             Summary: Sorts in subqueries are redundant and can be removed
                 Key: SPARK-23957
                 URL: https://issues.apache.org/jira/browse/SPARK-23957
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Henry Robinson


Unless combined with a {{LIMIT}}, there's no correctness reason that planned and optimized subqueries should have any sort operators (since the result of the subquery is an unordered collection of tuples). 

For example:

{{SELECT count(1) FROM (select id FROM dft ORDER by id)}}

has the following plan:
{code:java}
== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
      +- *(2) Project
         +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
            +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
               +- *(1) Project [id#0L]
                  +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
{code}
... but the sort operator is redundant.

Less intuitively, the sort is also redundant in selections from an ordered subquery:

{{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}

has plan:
{code:java}
== Physical Plan ==
*(2) Sort [id#0L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
   +- *(1) Project [id#0L]
      +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
{code}
... but again, since the subquery returns a bag of tuples, the sort is unnecessary.

We should consider adding an optimizer rule that removes a sort inside a subquery. SPARK-23375 is related, but removes sorts that are functionally redundant because they perform the same ordering.
  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org