You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (JIRA)" <ji...@apache.org> on 2017/03/02 12:39:45 UTC
[jira] [Comment Edited] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

    [ https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892169#comment-15892169 ] 

Takeshi Yamamuro edited comment on SPARK-19503 at 3/2/17 12:39 PM:
-------------------------------------------------------------------

I'm not sure this should be fixed though, postgresql leaves this kind of sorting as it is...;
{code}
postgres=# \d testTable
   Table "public.testTable"
 Column |  Type   | Modifiers 
--------+---------+-----------
 key    | integer | 
 value  | integer |

postgres=# select count(*) from (select * from testTable order by key) t;
 count 
-------
     1
(1 row)

postgres=# explain select count(*) from (select * from testTable order by key) t;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Aggregate  (cost=192.41..192.42 rows=1 width=0)
   ->  Sort  (cost=158.51..164.16 rows=2260 width=4)
         Sort Key: testTable.key
         ->  Seq Scan on testTable  (cost=0.00..32.60 rows=2260 width=4)
(4 rows)
{code}


was (Author: maropu):
I'm not sure this should be fixed though, postgresql leaves this kind of sorting as it is;
{code}
postgres=# \d testTable
   Table "public.testTable"
 Column |  Type   | Modifiers 
--------+---------+-----------
 key    | integer | 
 value  | integer |

postgres=# select count(*) from (select * from testTable order by key) t;
 count 
-------
     1
(1 row)

postgres=# explain select count(*) from (select * from testTable order by key) t;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Aggregate  (cost=192.41..192.42 rows=1 width=0)
   ->  Sort  (cost=158.51..164.16 rows=2260 width=4)
         Sort Key: testTable.key
         ->  Seq Scan on testTable  (cost=0.00..32.60 rows=2260 width=4)
(4 rows)
{code}

> Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19503
>                 URL: https://issues.apache.org/jira/browse/SPARK-19503
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 2.1.0
>         Environment: Perhaps only a pyspark or databricks AWS issue
>            Reporter: R
>            Priority: Minor
>              Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not required here and makes me wonder how smart the algebraic optimiser is indeed! The data may be partitioned by known count (such as parquet files) and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder what else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org