You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/15 16:11:41 UTC

[GitHub] [spark] aokolnychyi edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

aokolnychyi edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658839477


   Oops, thanks for catching the corner case quickly, @dongjoon-hyun!
   
   My original idea for this PR was based on the fact that a range partitioning followed by a local sort is equivalent to a global sort if expressions are compatible. Then I started to generalize this idea and there was no obvious corner case. While this one is very subtle, I think it makes sense if we think more about it. Repartition nodes change data distribution but may not necessarily change the ordering of data (at least, there may be sorted chunks). Partially, this is the reason why we excluded coalesce in the original proposal. Based on the example above, this seems to be true even if we hash partition our data.
   
   I'd explore cases where sort+repartition are next to each other. In that case, we are sure we change both the ordering and distribution and can potentially ignore the sort below. 
   
   For example, we may have this:
   
   ```
   sql("select * from (select * from (select * from t order by b desc) distribute by a) order by b asc")
   ```
   
   ```
   Sort [b#6 ASC NULLS FIRST], true
   +- RepartitionByExpression [a#5], 4
      +- Sort [b#6 DESC NULLS LAST], true
         +- Repartition 2, true
            +- LocalRelation [a#5, b#6]
   ```
   
   Is there a case where we want to keep the first sort before RepartitionByExpression?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org