You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Ulanov, Alexander" <al...@hpe.com> on 2015/10/01 00:55:47 UTC

GraphX PageRank keeps 3 copies of graph in memory

Dear Spark developers,

I would like to understand GraphX caching behavior with regards to PageRank in Spark, in particular, the following implementation of PageRank:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

On each iteration the new graph is created and cached, and the old graph is un-cached:
1) Create new graph and cache it:
rankGraph = rankGraph.joinVertices(rankUpdates) {
        (id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum
      }.cache()
2) Unpersist the old one:
      prevRankGraph.vertices.unpersist(false)
      prevRankGraph.edges.unpersist(false)

According to the code, at the end of each iteration only one graph should be in memory, i.e. one EdgeRDD and one VertexRDD. During the iteration, exactly between the mentioned lines of code, there will be two graphs: old and new. It is two pairs of Edge and Vertex RDDs. However, when I run the example provided in Spark examples folder, I observe the different behavior.

Run the example (I checked that it runs the mentioned code):
$SPARK_HOME/bin/spark-submit --class "org.apache.spark.examples.graphx.SynthBenchmark"  --master spark://mynode.net:7077 $SPARK_HOME/examples/target/spark-examples.jar

According to "Storage" and RDD DAG in Spark UI, 3 VertexRDDs and 3 EdgeRDDs are cached, even when all iterations are finished, given that the mentioned code suggests caching at most 2 (and only in particular stage of the iteration):
https://drive.google.com/file/d/0BzYMzvDiCep5WFpnQjFzNy0zYlU/view?usp=sharing
Edges (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S2JtYnhVTlV1Sms/view?usp=sharing
Vertices (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S1k4N2NFb05RZDA/view?usp=sharing

Could you explain, why 3 VertexRDDs and 3 EdgeRDDs are cached?

Is it OK that there is a double caching in code, given that joinVertices implicitly caches vertices and then the graph is cached in the PageRank code?

Best regards, Alexander

RE: GraphX PageRank keeps 3 copies of graph in memory

Posted by "Ulanov, Alexander" <al...@hpe.com>.

Hi Robin,

Sounds interesting. I am running 1.5.0. Could you copy-paste your Storage tab?

I’ve just double checked on another cluster with 1 master and 5 workers. It still has 3 pairs of VertexRDD and EdgeRDD at the end of benchmark’s execution:

RDD Name          Storage Level     Cached Partitions             Fraction Cached                Size in Memory Size in ExternalBlockStore          Size on Disk
VertexRDD         Memory Deserialized 1x Replicated         3              150%     6.9 MB  0.0 B      0.0 B
EdgeRDD             Memory Deserialized 1x Replicated         2              100%     155.5 MB             0.0 B      0.0 B
EdgeRDD             Memory Deserialized 1x Replicated         2              100%     154.7 MB             0.0 B      0.0 B
VertexRDD, VertexRDD Memory Deserialized 1x Replicated         3              150%     8.4 MB  0.0 B      0.0 B
EdgeRDD             Memory Deserialized 1x Replicated         2              100%     202.9 MB             0.0 B      0.0 B
VertexRDD         Memory Deserialized 1x Replicated         2              100%     5.6 MB  0.0 B      0.0 B

During the execution I observe that one pair is added and removed from the list. This should correspond to the unpersist statements in the code.

Also, according to the code, you one should end up with 1 set of RDDs, because of unpersist statements in the end of the loop. Does it make sense to you?

Best regards, Alexander

From: Robin East [mailto:robin.east@xense.co.uk]
Sent: Friday, October 02, 2015 12:27 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: GraphX PageRank keeps 3 copies of graph in memory

Alexander,

I’ve just run the benchmark and only end up with 2 sets of RDDs in the Storage tab. This is on 1.5.0, what version are you using?

Robin
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action




On 30 Sep 2015, at 23:55, Ulanov, Alexander <al...@hpe.com>> wrote:

Dear Spark developers,

I would like to understand GraphX caching behavior with regards to PageRank in Spark, in particular, the following implementation of PageRank:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

On each iteration the new graph is created and cached, and the old graph is un-cached:
1) Create new graph and cache it:
rankGraph = rankGraph.joinVertices(rankUpdates) {
        (id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum
      }.cache()
2) Unpersist the old one:
      prevRankGraph.vertices.unpersist(false)
      prevRankGraph.edges.unpersist(false)

According to the code, at the end of each iteration only one graph should be in memory, i.e. one EdgeRDD and one VertexRDD. During the iteration, exactly between the mentioned lines of code, there will be two graphs: old and new. It is two pairs of Edge and Vertex RDDs. However, when I run the example provided in Spark examples folder, I observe the different behavior.

Run the example (I checked that it runs the mentioned code):
$SPARK_HOME/bin/spark-submit --class "org.apache.spark.examples.graphx.SynthBenchmark"  --master spark://mynode.net:7077 $SPARK_HOME/examples/target/spark-examples.jar

According to “Storage” and RDD DAG in Spark UI, 3 VertexRDDs and 3 EdgeRDDs are cached, even when all iterations are finished, given that the mentioned code suggests caching at most 2 (and only in particular stage of the iteration):
https://drive.google.com/file/d/0BzYMzvDiCep5WFpnQjFzNy0zYlU/view?usp=sharing
Edges (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S2JtYnhVTlV1Sms/view?usp=sharing
Vertices (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S1k4N2NFb05RZDA/view?usp=sharing

Could you explain, why 3 VertexRDDs and 3 EdgeRDDs are cached?

Is it OK that there is a double caching in code, given that joinVertices implicitly caches vertices and then the graph is cached in the PageRank code?

Best regards, Alexander

RE: GraphX PageRank keeps 3 copies of graph in memory

Posted by "Ulanov, Alexander" <al...@hpe.com>.

Hi Ankur,

Could you help with explanation of the problem below?

Best regards, Alexander

From: Ulanov, Alexander
Sent: Friday, October 02, 2015 11:39 AM
To: 'Robin East'
Cc: dev@spark.apache.org
Subject: RE: GraphX PageRank keeps 3 copies of graph in memory

Hi Robin,

Sounds interesting. I am running 1.5.0. Could you copy-paste your Storage tab?

I’ve just double checked on another cluster with 1 master and 5 workers. It still has 3 pairs of VertexRDD and EdgeRDD at the end of benchmark’s execution:

RDD Name          Storage Level     Cached Partitions             Fraction Cached                Size in Memory Size in ExternalBlockStore          Size on Disk
VertexRDD         Memory Deserialized 1x Replicated         3              150%     6.9 MB  0.0 B      0.0 B
EdgeRDD             Memory Deserialized 1x Replicated         2              100%     155.5 MB             0.0 B      0.0 B
EdgeRDD             Memory Deserialized 1x Replicated         2              100%     154.7 MB             0.0 B      0.0 B
VertexRDD, VertexRDD Memory Deserialized 1x Replicated         3              150%     8.4 MB  0.0 B      0.0 B
EdgeRDD             Memory Deserialized 1x Replicated         2              100%     202.9 MB             0.0 B      0.0 B
VertexRDD         Memory Deserialized 1x Replicated         2              100%     5.6 MB  0.0 B      0.0 B

During the execution I observe that one pair is added and removed from the list. This should correspond to the unpersist statements in the code.

Also, according to the code, you one should end up with 1 set of RDDs, because of unpersist statements in the end of the loop. Does it make sense to you?

Best regards, Alexander

From: Robin East [mailto:robin.east@xense.co.uk]
Sent: Friday, October 02, 2015 12:27 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: GraphX PageRank keeps 3 copies of graph in memory

Alexander,

I’ve just run the benchmark and only end up with 2 sets of RDDs in the Storage tab. This is on 1.5.0, what version are you using?

Robin
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action

On 30 Sep 2015, at 23:55, Ulanov, Alexander <al...@hpe.com>> wrote:

Dear Spark developers,

I would like to understand GraphX caching behavior with regards to PageRank in Spark, in particular, the following implementation of PageRank:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

On each iteration the new graph is created and cached, and the old graph is un-cached:
1) Create new graph and cache it:
rankGraph = rankGraph.joinVertices(rankUpdates) {
        (id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum
      }.cache()
2) Unpersist the old one:
      prevRankGraph.vertices.unpersist(false)
      prevRankGraph.edges.unpersist(false)

According to the code, at the end of each iteration only one graph should be in memory, i.e. one EdgeRDD and one VertexRDD. During the iteration, exactly between the mentioned lines of code, there will be two graphs: old and new. It is two pairs of Edge and Vertex RDDs. However, when I run the example provided in Spark examples folder, I observe the different behavior.

Run the example (I checked that it runs the mentioned code):
$SPARK_HOME/bin/spark-submit --class "org.apache.spark.examples.graphx.SynthBenchmark"  --master spark://mynode.net:7077 $SPARK_HOME/examples/target/spark-examples.jar

According to “Storage” and RDD DAG in Spark UI, 3 VertexRDDs and 3 EdgeRDDs are cached, even when all iterations are finished, given that the mentioned code suggests caching at most 2 (and only in particular stage of the iteration):
https://drive.google.com/file/d/0BzYMzvDiCep5WFpnQjFzNy0zYlU/view?usp=sharing
Edges (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S2JtYnhVTlV1Sms/view?usp=sharing
Vertices (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S1k4N2NFb05RZDA/view?usp=sharing

Could you explain, why 3 VertexRDDs and 3 EdgeRDDs are cached?

Is it OK that there is a double caching in code, given that joinVertices implicitly caches vertices and then the graph is cached in the PageRank code?

Best regards, Alexander