You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ashish Tadose <as...@gmail.com> on 2016/09/08 17:35:24 UTC

Returning DataFrame as Scala method return type

Hi Team,

I have Spark job with large number of dataframe operations.

This job reads various lookup data from external table as MySql and also
run lot of dataframe operations on large data on hdfs in parquet.

Job works fine in cluster however jobdriver code looks clumsy because of
large number of operations written in driver method.

I wish to organize these dataframe operations by grouping them Scala Object
methods.
Something like below



> *Object Driver {*
> *def main(args: Array[String]) {*
> *  val df = Operations.process(sparkContext)*
> *  }**}*
>
>
> *Object Operations {*
> *  def process(sparkContext: SparkContext) : DataFrame = {*
> *    //series of dataframe operations *
> *  }**}*


My stupid question is would retrieving DF from other Scala Object's method
as return type is right thing do in terms of large scale.
Would returning DF to driver will cause all data get passed to the driver
code or it would be return just pointer to the DF?


Thanks,
Ashish

Re: Returning DataFrame as Scala method return type

Posted by Jakob Odersky <ja...@odersky.com>.
(Maybe unrelated FYI): in case you're using only Scala or Java with
Spark, I would recommend to use Datasets instead of DataFrames. They
provide exactly the same functionality, yet offer more type-safety.

On Thu, Sep 8, 2016 at 11:05 AM, Lee Becker <le...@hapara.com> wrote:
>
> On Thu, Sep 8, 2016 at 11:35 AM, Ashish Tadose <as...@gmail.com>
> wrote:
>>
>> I wish to organize these dataframe operations by grouping them Scala
>> Object methods.
>> Something like below
>>
>>
>>> Object Driver {
>>> def main(args: Array[String]) {
>>>   val df = Operations.process(sparkContext)
>>>   }
>>> }
>>>
>>> Object Operations {
>>>   def process(sparkContext: SparkContext) : DataFrame = {
>>>     //series of dataframe operations
>>>   }
>>> }
>>
>>
>> My stupid question is would retrieving DF from other Scala Object's method
>> as return type is right thing do in terms of large scale.
>> Would returning DF to driver will cause all data get passed to the driver
>> code or it would be return just pointer to the DF?
>
>
> As long as the methods do not trigger any executions, it is fine to pass a
> DataFrame back to the driver.  Think of a DataFrame as an abstraction over
> RDDs.  When you return an RDD or DataFrame you're not returning the object
> itself.  Instead you're returning a recipe that details the series of
> operations needed to produce the data.
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Returning DataFrame as Scala method return type

Posted by Lee Becker <le...@hapara.com>.
On Thu, Sep 8, 2016 at 11:35 AM, Ashish Tadose <as...@gmail.com>
wrote:

> I wish to organize these dataframe operations by grouping them Scala
> Object methods.
> Something like below
>
>
>
>> *Object Driver {*
>> *def main(args: Array[String]) {*
>> *  val df = Operations.process(sparkContext)*
>> *  }**}*
>>
>>
>> *Object Operations {*
>> *  def process(sparkContext: SparkContext) : DataFrame = {*
>> *    //series of dataframe operations *
>> *  }**}*
>
>
> My stupid question is would retrieving DF from other Scala Object's method
> as return type is right thing do in terms of large scale.
> Would returning DF to driver will cause all data get passed to the driver
> code or it would be return just pointer to the DF?
>

As long as the methods do not trigger any executions, it is fine to pass a
DataFrame back to the driver.  Think of a DataFrame as an abstraction over
RDDs.  When you return an RDD or DataFrame you're not returning the object
itself.  Instead you're returning a recipe that details the series of
operations needed to produce the data.