You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Raghava Mutharaju <m....@gmail.com> on 2016/01/24 01:48:01 UTC

understanding iterative algorithms in Spark

Hello All,

I am new to Spark and I am trying to understand how iterative application
of operations are handled in Spark. Consider the following program in Scala.

var u = sc.textFile(args(0)+"s1.txt").map(line => {
       line.split("\\|") match { case Array(x,y) => (y.toInt,x.toInt)}})

u.cache()
println("Before iteration u count: "+u.count())

val type1 = sc.textFile(args(0)+"Type1.txt").map(line => {
        line.split("\\|") match { case Array(x,y) => (x.toInt,y.toInt)}})
type1.cache()
println("Type1 count: " + type1.count())

var counter=0
while(counter < 20) {
        val r1Join = type1.join(u).map( { case (k,v) => v}).cache()
        u = u.union(r1Join).distinct.cache()
        //testing checkpoint
        if(counter == 4)
        u.checkpoint()
        println("u count: "+u.count())
        counter += 1
}

>From the UI, I have attached the DAG visualizations at various iterations.

I have the following questions. It would be of great help if someone can
answer them.

1) When we cache a RDD, is it safe to say that it will not be recomputed?
For example in dag1.png, all the green map dots will not be recomputed.

2) In dag1.png, for stage4 join, we expected one input to be the output of
stage3 (this is as per our expectation) and the other input to be the
output of stage2. The latter does not happen. Why is this the case?

3) In dag1.png, why is stage5 not part of stage4? Why is distinct and
distinct + cache separated? Will distinct be run twice?

4) In dag4.png, we expected the input of join in stage21 would come from
the output of stage19 but instead, it gets recomputed at the beginning of
stage21. Why would distinct gets recomputed at the beginning of each
iteration?

5) In dag2.png, join operation is represented by 3 boxes. What does this
mean?

6) In dag4.png, there are several "skipped" stages. Is it safe to assume
that the skipped stages not recomputed again?

Thanks in advance.

--
Regards,
Raghava