You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Simone Franzini <ca...@gmail.com> on 2014/11/14 17:31:31 UTC

Declaring multiple RDDs and efficiency concerns

Let's say I have to apply a complex sequence of operations to a certain RDD.
In order to make code more modular/readable, I would typically have
something like this:

object myObject {
  def main(args: Array[String]) {
    val rdd1 = function1(myRdd)
    val rdd2 = function2(rdd1)
    val rdd3 = function3(rdd2)
  }

  def function1(rdd: RDD) : RDD = { doSomething }
  def function2(rdd: RDD) : RDD = { doSomethingElse }
  def function3(rdd: RDD) : RDD = { doSomethingElseYet }
}

So I am explicitly declaring vals for the intermediate steps. Does this end
up using more storage than if I just chained all of the operations and
declared only one val instead?
If yes, is there a better way to chain together the operations?
Ideally I would like to do something like:

val rdd = function1.function2.function3

Is there a way I can write the signature of my functions to accomplish
this? Is this also an efficiency issue or just a stylistic one?

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

Re: Declaring multiple RDDs and efficiency concerns

Posted by Sean Owen <so...@cloudera.com>.
This code executes on the driver, and an "RDD" here is really just a
handle on all the distributed data out there. It's a local bookkeeping
object. So, manipulation of these objects themselves in the local
driver code has virtually no performance impact. These two versions
would be about identical*.

* maybe someone can point out a case where not maintaining the
reference lets something get cleaned up earlier, but I'm not aware of
this sort of effect

On Fri, Nov 14, 2014 at 4:31 PM, Simone Franzini <ca...@gmail.com> wrote:
> Let's say I have to apply a complex sequence of operations to a certain RDD.
> In order to make code more modular/readable, I would typically have
> something like this:
>
> object myObject {
>   def main(args: Array[String]) {
>     val rdd1 = function1(myRdd)
>     val rdd2 = function2(rdd1)
>     val rdd3 = function3(rdd2)
>   }
>
>   def function1(rdd: RDD) : RDD = { doSomething }
>   def function2(rdd: RDD) : RDD = { doSomethingElse }
>   def function3(rdd: RDD) : RDD = { doSomethingElseYet }
> }
>
> So I am explicitly declaring vals for the intermediate steps. Does this end
> up using more storage than if I just chained all of the operations and
> declared only one val instead?
> If yes, is there a better way to chain together the operations?
> Ideally I would like to do something like:
>
> val rdd = function1.function2.function3
>
> Is there a way I can write the signature of my functions to accomplish this?
> Is this also an efficiency issue or just a stylistic one?
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Declaring multiple RDDs and efficiency concerns

Posted by Rishi Yadav <ri...@infoobjects.com>.
how about using fluent style of Scala programming.


On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini <ca...@gmail.com>
wrote:

> Let's say I have to apply a complex sequence of operations to a certain
> RDD.
> In order to make code more modular/readable, I would typically have
> something like this:
>
> object myObject {
>   def main(args: Array[String]) {
>     val rdd1 = function1(myRdd)
>     val rdd2 = function2(rdd1)
>     val rdd3 = function3(rdd2)
>   }
>
>   def function1(rdd: RDD) : RDD = { doSomething }
>   def function2(rdd: RDD) : RDD = { doSomethingElse }
>   def function3(rdd: RDD) : RDD = { doSomethingElseYet }
> }
>
> So I am explicitly declaring vals for the intermediate steps. Does this
> end up using more storage than if I just chained all of the operations and
> declared only one val instead?
> If yes, is there a better way to chain together the operations?
> Ideally I would like to do something like:
>
> val rdd = function1.function2.function3
>
> Is there a way I can write the signature of my functions to accomplish
> this? Is this also an efficiency issue or just a stylistic one?
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>