You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by dash <bs...@nd.edu> on 2014/06/18 00:29:58 UTC

Best practices for removing lineage of a RDD or Graph object?

If a RDD object have non-empty .dependencies, does that means it have
lineage? How could I remove it?

I'm doing iterative computing and each iteration depends on the result
computed in previous iteration. After several iteration, it will throw
StackOverflowError.

At first I'm trying to use cache, I read the code in pregel.scala, which is
part of GraphX, they use a count method to materialize the object after
cache, but I attached a debugger and seems such approach does not empty
.dependencies, and that also does not work in my code.

Another alternative approach is using checkpoint, I tried checkpoint
vertices and edges for my Graph object and then materialize it by count
vertices and edges. Then I use .isCheckpointed to check if it is correctly
checkpointed, but it always return false.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-for-removing-lineage-of-a-RDD-or-Graph-object-tp7779.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Best practices for removing lineage of a RDD or Graph object?

Posted by dash <bs...@nd.edu>.

Hi Roy, 

Thanks for your help, I write a small code snippet that could reproduce the problem.
Could you help me read through it and see if I did anything wrong?

Thanks!

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName(“TEST")
      .setMaster("local[4]")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.kryo.registrator", "edu.nd.dsg.hdtm.util.HDTMKryoRegistrator")
    val sc = new SparkContext(conf)

    val v = sc.parallelize(Seq[(VertexId, Long)]((0L, 0L), (1L, 1L), (2L, 2L)))
    val e = sc.parallelize(Seq[Edge[Long]](Edge(0L, 1L, 0L), Edge(1L, 2L, 1L), Edge(2L, 0L, 2L)))
    val newGraph = Graph(v, e)
    var currentGraph = newGraph
    val vertexIds = currentGraph.vertices.map(_._1).collect()

    for (i <- 1 to 1000) {
      var g = currentGraph
      vertexIds.toStream.foreach(id => {
        g = Graph(currentGraph.vertices, currentGraph.edges)
        g.cache()
        g.edges.cache()
        g.vertices.cache()
        g.vertices.count()
        g.edges.count()
      })

      currentGraph.unpersistVertices(blocking =  false)
      currentGraph.edges.unpersist(blocking = false)
      currentGraph = g
      println(" iter "+i+" finished")
    }

  }


Baoxu Shi(Dash)
Computer Science and Engineering Department
University of Notre Dame
bshi@nd.edu



> On Jun 19, 2014, at 1:47 AM, roy20021 [via Apache Spark User List] <ml...@n3.nabble.com> wrote:
> 
> No sure if it can help, btw:
> Checkpoint cuts the lineage. The checkpoint method is a flag. In order to actually perform the checkpoint you must do NOT materialise the RDD before it has been flagged otherwise the flag is just ignored.
> 
> rdd2 = rdd1.map(..)
> rdd2.checkpoint()
> rdd2.count
> rdd2.isCheckpointed // true
> 
> Il mercoledì 18 giugno 2014, dash <[hidden email]> ha scritto:
> > If a RDD object have non-empty .dependencies, does that means it have
> > lineage? How could I remove it?
> >
> > I'm doing iterative computing and each iteration depends on the result
> > computed in previous iteration. After several iteration, it will throw
> > StackOverflowError.
> >
> > At first I'm trying to use cache, I read the code in pregel.scala, which is
> > part of GraphX, they use a count method to materialize the object after
> > cache, but I attached a debugger and seems such approach does not empty
> > .dependencies, and that also does not work in my code.
> >
> > Another alternative approach is using checkpoint, I tried checkpoint
> > vertices and edges for my Graph object and then materialize it by count
> > vertices and edges. Then I use .isCheckpointed to check if it is correctly
> > checkpointed, but it always return false.
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-for-removing-lineage-of-a-RDD-or-Graph-object-tp7779.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> > 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-for-removing-lineage-of-a-RDD-or-Graph-object-tp7779p7892.html
> To unsubscribe from Best practices for removing lineage of a RDD or Graph object?, click here.
> NAML





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-for-removing-lineage-of-a-RDD-or-Graph-object-tp7779p7893.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Best practices for removing lineage of a RDD or Graph object?

Posted by Andrea Esposito <an...@gmail.com>.

No sure if it can help, btw:
Checkpoint cuts the lineage. The checkpoint method is a flag. In order to
actually perform the checkpoint you must do NOT materialise the RDD before
it has been flagged otherwise the flag is just ignored.

rdd2 = rdd1.map(..)
rdd2.checkpoint()
rdd2.count
rdd2.isCheckpointed // true

Il mercoledì 18 giugno 2014, dash <bs...@nd.edu> ha scritto:
> If a RDD object have non-empty .dependencies, does that means it have
> lineage? How could I remove it?
>
> I'm doing iterative computing and each iteration depends on the result
> computed in previous iteration. After several iteration, it will throw
> StackOverflowError.
>
> At first I'm trying to use cache, I read the code in pregel.scala, which
is
> part of GraphX, they use a count method to materialize the object after
> cache, but I attached a debugger and seems such approach does not empty
> .dependencies, and that also does not work in my code.
>
> Another alternative approach is using checkpoint, I tried checkpoint
> vertices and edges for my Graph object and then materialize it by count
> vertices and edges. Then I use .isCheckpointed to check if it is correctly
> checkpointed, but it always return false.
>
>
>
> --
> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-for-removing-lineage-of-a-RDD-or-Graph-object-tp7779.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>