You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by sachintyagi22 <sa...@gmail.com> on 2015/08/25 13:15:18 UTC

Checkpointing in Iterative Graph Computation

Hi, 

I have stumbled upon an issue with iterative Graphx computation (using v
1.4.1). It goes thusly --

Setup
1. Construct a graph.
2. Validate that the graph satisfies certain conditions. Here I do some
assert(*conditions*) within graph.triplets.foreach(). [Notice that this
materializes the graph.]

For n iterations
3. Update graph edges and vertices.
4. Collect deltas over whole of graph (to be used in next iteration). Again,
this is done through  graph.aggregate() and this materializes the graph.
5. Update the graph and use it in next iteration (step 3).

Now the problem is -- after about 300 iterations I run into Stackoverflow
error due to the lengthy lineage. So, I decided to checkpoint the graph
after every k iterations. But it doesn't work. 

The problem is -- once a graph is materialized then calling checkpoint() on
it has no effect, even after materializing the graph again. In fact the
isCheckpointed() method on such an RDD will always return false, even after
calling checkpoint() and count() on the RDD. Following code should clarify - 

    val users = sc.parallelize(Array((3L, ("rxin", "student")), (7L,
("jgonzal", "postdoc")))
    //Materialize the RDD
    users.count()
    //Now call the checkpoint
    users.checkpoint()
    users.count()
    
    //This fails
    assert(users.isCheckpointed)

And it works the same with Graph.checkpoint(). Now my problem is that in
both setup and iteration steps (Step 2 and 5 above) I have to materialize
the graph, and so it leaves me in a situation where I can not checkpoint it
in a usual fashion.

Currently, I am working around this by creating a new Graph every kth
iteration with the same edges and vertices and then checkpointing it and
then using this new graph for k+1 to 2k iterations and so on. This works.

Now my question are - 
1. Why doesn't checkpointing work on an RDD if it is materialized? 
2. My use case looks pretty common, how do people generally handle this?

Thanks in advance.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-in-Iterative-Graph-Computation-tp24443.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Checkpointing in Iterative Graph Computation

Posted by Robineast <Ro...@xense.co.uk>.
One other thought - you need to call SparkContext.setCheckpointDir otherwise
nothing will happen



-----
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-in-Iterative-Graph-Computation-tp24443p25013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Checkpointing in Iterative Graph Computation

Posted by Robineast <Ro...@xense.co.uk>.
You need to checkpoint before you materialize. You'll find you probably only
want to checkpoint every 100 or so iterations otherwise the checkpointing
will slow down your application excessively



-----
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-in-Iterative-Graph-Computation-tp24443p25012.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org