You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/07 17:36:00 UTC

[GitHub] [spark] fitermay edited a comment on issue #23986: [SPARK-27070] Fix performance bug in DefaultPartitionCoalescer

fitermay edited a comment on issue #23986: [SPARK-27070] Fix performance bug in DefaultPartitionCoalescer
URL: https://github.com/apache/spark/pull/23986#issuecomment-470621102
 
 
   @srowen @attilapiros 
   After digging into this further I found out what exactly is happening and why sorting here causes a major issue.
   
   It runs out EMRFS returns the string '*' as the host of each block.  This ends up invoking the worst case of this algorithm where it tries to jam everything into the same preferred partition. In turn, this ends up running sort on hundreds thousands of records each iteration to find the minimum.  I've contacted the EMR team to suggest changing the host to 'localhost' but apparently that would break MR performance on Yarn. 
   
   I still think this patch is a win because:
   
   1) It's actually simpler and less code than the pre-patch code
   2) There are lots of EMR users who would benefit from this until a strategic solution is found
   3) It improves performance in less extreme cases as well
   
   I will try to make the suggested changes tonight and also generate some performance numbers for the extreme case tonight.
   
   Thanks!
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org