You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by cjwang <cj...@cjwang.us> on 2014/09/16 07:31:47 UTC

Complexity/Efficiency of SortByKey

I wonder what algorithm is used to implement sortByKey?  I assume it is some
O(n*log(n))  parallelized on x number of nodes, right?

Then, what size of data would make it worthwhile to use sortByKey on
multiple processors rather than use standard Scala sort functions on a
single processor (considering the overhead of putting stuff into RDDs and
collecting them back)?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Complexity-Efficiency-of-SortByKey-tp14328.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Complexity/Efficiency of SortByKey

Posted by Matei Zaharia <ma...@gmail.com>.

sortByKey is indeed O(n log n), it's a first pass to figure out even-sized partitions (by sampling the RDD), then a second pass to do a distributed merge-sort (first partition the data on each machine, then run a reduce phase that merges the data for each partition). The point where it becomes useful to scale out versus a single machine is probably pretty high, because communication over a network is *much* slower than memory bandwidth within a machine. Generally it would make the most sense for data that doesn't fit in memory on a single machine, or data that already starts out distributed.

Please also note that if you run Spark on just one multicore machine, it still goes through many of the same code paths as on a cluster (e.g. serializing data between tasks) -- it's not optimized to be as fast as, say, a multithreaded sort framework. So it wouldn't make a ton of sense to use it for that.

Matei

On September 15, 2014 at 10:32:14 PM, cjwang (cj@cjwang.us) wrote:

I wonder what algorithm is used to implement sortByKey? I assume it is some
O(n*log(n)) parallelized on x number of nodes, right?

Then, what size of data would make it worthwhile to use sortByKey on
multiple processors rather than use standard Scala sort functions on a
single processor (considering the overhead of putting stuff into RDDs and
collecting them back)?

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Complexity-Efficiency-of-SortByKey-tp14328.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org