You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Larry Xiao <xi...@sjtu.edu.cn> on 2014/09/25 18:41:12 UTC

VertexRDD partition imbalance

Hi all

VertexRDD is partitioned with HashPartitioner, and it exhibits some 
imbalance of tasks.
For example, Connected Components with partition strategy Edge2D:


        Aggregated Metrics by Executor

Executor ID 	Task Time 	Total Tasks 	Failed Tasks 	Succeeded Tasks 
Input 	Shuffle Read 	Shuffle Write 	Shuffle Spill (Memory) 	Shuffle 
Spill (Disk)
1 	10 s 	10 	0 	10 	234.6 MB 	0.0 B 	43.2 MB 	0.0 B 	0.0 B
2 	3 s 	3 	0 	3 	70.4 MB 	0.0 B 	13.0 MB 	0.0 B 	0.0 B
3 	6 s 	6 	0 	6 	140.7 MB 	0.0 B 	25.9 MB 	0.0 B 	0.0 B
4 	9 s 	8 	0 	8 	187.9 MB 	0.0 B 	34.6 MB 	0.0 B 	0.0 B
5 	10 s 	9 	0 	9 	211.4 MB 	0.0 B 	38.9 MB 	0.0 B 	0.0 B

For a stage on mapPartitions at VertexRDD.scala:347
343
344   /** Generates an RDD of vertex attributes suitable for shipping to 
the edge partitions. */
345   private[graphx] def shipVertexAttributes(
346       shipSrc: Boolean, shipDst: Boolean): RDD[(PartitionID, 
VertexAttributeBlock[VD])] = {
347 
partitionsRDD.mapPartitions(_.flatMap(_.shipVertexAttributes(shipSrc, 
shipDst)))
348   }
349

This is executed for every iteration in Pregel, so the imbalance is bad 
for performance.

However, when run PageRank with Edge2D, the tasks are even across 
executors. (all finish 6 tasks)
Our configuration is 6 node, 36 partitions.

My questions is:

    What decides the number of tasks for different executors? And how to
    make it balance?

Thanks!
Larry