You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Simone Franzini <ca...@gmail.com> on 2014/11/14 17:31:31 UTC
Declaring multiple RDDs and efficiency concerns
Let's say I have to apply a complex sequence of operations to a certain RDD.
In order to make code more modular/readable, I would typically have
something like this:
object myObject {
def main(args: Array[String]) {
val rdd1 = function1(myRdd)
val rdd2 = function2(rdd1)
val rdd3 = function3(rdd2)
}
def function1(rdd: RDD) : RDD = { doSomething }
def function2(rdd: RDD) : RDD = { doSomethingElse }
def function3(rdd: RDD) : RDD = { doSomethingElseYet }
}
So I am explicitly declaring vals for the intermediate steps. Does this end
up using more storage than if I just chained all of the operations and
declared only one val instead?
If yes, is there a better way to chain together the operations?
Ideally I would like to do something like:
val rdd = function1.function2.function3
Is there a way I can write the signature of my functions to accomplish
this? Is this also an efficiency issue or just a stylistic one?
Simone Franzini, PhD
http://www.linkedin.com/in/simonefranzini
Re: Declaring multiple RDDs and efficiency concerns
Posted by Sean Owen <so...@cloudera.com>.
This code executes on the driver, and an "RDD" here is really just a
handle on all the distributed data out there. It's a local bookkeeping
object. So, manipulation of these objects themselves in the local
driver code has virtually no performance impact. These two versions
would be about identical*.
* maybe someone can point out a case where not maintaining the
reference lets something get cleaned up earlier, but I'm not aware of
this sort of effect
On Fri, Nov 14, 2014 at 4:31 PM, Simone Franzini <ca...@gmail.com> wrote:
> Let's say I have to apply a complex sequence of operations to a certain RDD.
> In order to make code more modular/readable, I would typically have
> something like this:
>
> object myObject {
> def main(args: Array[String]) {
> val rdd1 = function1(myRdd)
> val rdd2 = function2(rdd1)
> val rdd3 = function3(rdd2)
> }
>
> def function1(rdd: RDD) : RDD = { doSomething }
> def function2(rdd: RDD) : RDD = { doSomethingElse }
> def function3(rdd: RDD) : RDD = { doSomethingElseYet }
> }
>
> So I am explicitly declaring vals for the intermediate steps. Does this end
> up using more storage than if I just chained all of the operations and
> declared only one val instead?
> If yes, is there a better way to chain together the operations?
> Ideally I would like to do something like:
>
> val rdd = function1.function2.function3
>
> Is there a way I can write the signature of my functions to accomplish this?
> Is this also an efficiency issue or just a stylistic one?
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Declaring multiple RDDs and efficiency concerns
Posted by Rishi Yadav <ri...@infoobjects.com>.
how about using fluent style of Scala programming.
On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini <ca...@gmail.com>
wrote:
> Let's say I have to apply a complex sequence of operations to a certain
> RDD.
> In order to make code more modular/readable, I would typically have
> something like this:
>
> object myObject {
> def main(args: Array[String]) {
> val rdd1 = function1(myRdd)
> val rdd2 = function2(rdd1)
> val rdd3 = function3(rdd2)
> }
>
> def function1(rdd: RDD) : RDD = { doSomething }
> def function2(rdd: RDD) : RDD = { doSomethingElse }
> def function3(rdd: RDD) : RDD = { doSomethingElseYet }
> }
>
> So I am explicitly declaring vals for the intermediate steps. Does this
> end up using more storage than if I just chained all of the operations and
> declared only one val instead?
> If yes, is there a better way to chain together the operations?
> Ideally I would like to do something like:
>
> val rdd = function1.function2.function3
>
> Is there a way I can write the signature of my functions to accomplish
> this? Is this also an efficiency issue or just a stylistic one?
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>