You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Daniel Margo <dm...@eecs.harvard.edu> on 2015/11/06 12:35:01 UTC

GraphX EdgePartition format

I was looking through the GraphX source and noticed that the topology of an
EdgePartition is a triplet of source, destination, and data columns --
essentially a COO sparse matrix -- sorted by source, and equipped with an
index from each (global) vertex ID to the start of its (local) source
cluster. This index provides efficient local neighborhood lookup.

Given that the columns are source-sorted, is there a reason that the
duplicate values in the source column are not efficiently packed, as in e.g.
a CSR sparse matrix? That is, replace every source cluster with a single
source value plus a length. Furthermore, these source values would duplicate
the existing global2local index, so they can be removed entirely.

This is a common optimization in sparse matrix systems and I recall (perhaps
incorrectly) that GraphLab used this format -- is there a reason that GraphX
does not?
-dwm




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/GraphX-EdgePartition-format-tp15020.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org