You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Henry Robinson (JIRA)" <ji...@apache.org> on 2018/04/10 22:52:00 UTC
[jira] [Created] (SPARK-23957) Sorts in subqueries are redundant
and can be removed
Henry Robinson created SPARK-23957:
--------------------------------------
Summary: Sorts in subqueries are redundant and can be removed
Key: SPARK-23957
URL: https://issues.apache.org/jira/browse/SPARK-23957
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.4.0
Reporter: Henry Robinson
Unless combined with a {{LIMIT}}, there's no correctness reason that planned and optimized subqueries should have any sort operators (since the result of the subquery is an unordered collection of tuples).
For example:
{{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
has the following plan:
{code:java}
== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(2) Project
+- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
+- *(1) Project [id#0L]
+- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
{code}
... but the sort operator is redundant.
Less intuitively, the sort is also redundant in selections from an ordered subquery:
{{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
has plan:
{code:java}
== Physical Plan ==
*(2) Sort [id#0L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
+- *(1) Project [id#0L]
+- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
{code}
... but again, since the subquery returns a bag of tuples, the sort is unnecessary.
We should consider adding an optimizer rule that removes a sort inside a subquery. SPARK-23375 is related, but removes sorts that are functionally redundant because they perform the same ordering.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org