You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ankur Dave (JIRA)" <ji...@apache.org> on 2014/09/07 03:34:28 UTC

[jira] [Created] (SPARK-3427) Standalone version of static PageRank

Ankur Dave created SPARK-3427:
---------------------------------

             Summary: Standalone version of static PageRank
                 Key: SPARK-3427
                 URL: https://issues.apache.org/jira/browse/SPARK-3427
             Project: Spark
          Issue Type: Improvement
          Components: GraphX
            Reporter: Ankur Dave
            Assignee: Ankur Dave


GraphX's current implementation of static (fixed iteration count) PageRank is implemented using the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs:

1. A shuffle per iteration to ship the active sets to the edge partitions.
2. A hash table creation per iteration at each partition to index the active sets for lookup.
3. A hash lookup per edge to check whether the source vertex is active.

I reimplemented static PageRank using the GraphX API instead of the higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org