You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/15 08:02:04 UTC

[GitHub] [spark] hvanhovell commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

hvanhovell commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658612372


   Ehh... AFAIK nested ordering can be ignored from a relation algebra point of view. So I am not sure this is a very solid argument. This feels a bit like an example of [hyrum's law](https://www.hyrumslaw.com/). If you want sorted runs in ORC then you ought to fix is there, and not rely on some implicit system behavior.
   
   Regarding the shuffles. If the data is sorted before it goes into the shuffle, then the individual shuffle blocks are sorted. This is also the reason why doing a sort aggregate is not completely terrible (TimSort is good at identifying sorted runs).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org