You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by bchazalet <bc...@companywatch.net> on 2014/12/02 18:19:59 UTC

sort algorithm using sortBy

I am trying to understand the sort algorithm that is used in RDD#sortBy. I
have read that post  from Matei
<http://apache-spark-user-list.1001560.n3.nabble.com/Complexity-Efficiency-of-SortByKey-tp14328p14332.html>  
and that helps a little bit already.

I'd like to further understand the distributed merge-sort because in my case
the sort takes 10 times longer if it happens on a field whose values are not
well distributed (the field's value is 0 for many of the items) compared to
a sort on a field whose values are better distributed.

In particular, I am wondering if the sort algorithm can be modified/injected
with one that would better fit the first distribution (given that this would
be known in advance).

I'll be happy to look at the code myself, if someone could provide me with a
pointer to the file(s) I should have a look at.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sort-algorithm-using-sortBy-tp20179.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org