You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Abarah <se...@yahoo.com> on 2015/07/15 05:30:08 UTC

Using reference for RDD is safe?

Hello, I am wondering what will happen if I use a reference for transforming
rdd, for example:

def func1(rdd: RDD[Int]): RDD[Int] = {
    rdd.map(x => x * 2) // example transformation, but I am using a more
complex function
}

def main() {
   .....
   val myrdd = sc.parallelize(1 to 1000000)
   val myrdd2 = func1(myrdd)
   myrdd2.count()
}

The above is just an example. I am wondering if it is safe to use reference
like this. Thank you.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-reference-for-RDD-is-safe-tp23843.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Using reference for RDD is safe?

Posted by Mina <se...@yahoo.com>.
Hi, thank you for your answer. but i was talking about function reference.
I want to transform an RDD using a function consisting of multiple
transforms.
For example
def transformFunc1(rdd: RDD[Int]): RDD[Int]  = {


}


val rdd2 = transformFunc1(rdd1)...
here i am using reference, i think but i am not sure. 




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-reference-for-RDD-is-safe-tp23843p23911.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Using reference for RDD is safe?

Posted by Gylfi <gy...@berkeley.edu>.
Hi. 

"All transformations in Spark are lazy, in that they do not compute their
results right away. Instead, they just remember the transformations applied
to some base dataset (e.g. a file). The transformations are only computed
when an action requires a result to be returned to the driver program. This
design enables Spark to run more efficiently – for example, we can realize
that a dataset created through map will be used in a reduce and return only
the result of the reduce to the driver, rather than the larger mapped
dataset."
See section "RDD Operations" in
https://spark.apache.org/docs/1.2.0/programming-guide.html

Thus, neither your myrdd2 nor myrdd will exist until you call the count. 
What is stored is just "how to create myrdd and myrdd2" so yes, this is
safe.. 

When you run myrdd2.count the both RDDs are created, myrdd2 is counted and
the count printed out.
After the operation both RDDs are "destroyed" again. 
If you run the myrdd2.count again, both myrdd and myrdd2 are created again
.. 

If your transformation is expensive, you may want to keep the data around
and for that must use .persist() or .cache() etc.  

Regards,
   Gylfi. 




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-reference-for-RDD-is-safe-tp23843p23894.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org