You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Martin Serrano <ma...@attivio.com> on 2016/06/24 14:27:53 UTC

DataFrame versus Dataset creation and usage

Hi,

I'm exposing a custom source to the Spark environment.  I have a question about the best way to approach this problem.

I created a custom relation for my source and it creates a DataFrame<Row>.  My custom source knows the data types which are dynamic so this seemed to be the appropriate return type.  This works fine.

The next step I want to take is to expose some custom mapping functions (written in Java).  But when I look at the APIs, the map method for DataFrame returns an RDD (not a DataFrame).  (Should I use SqlContext.createDataFrame on the result? -- does this result in additional processing overhead?)  The Dataset type seems to be more of what I'd be looking for, it's map method returns the Dataset type.  So chaining them together is a natural exercise.

But to create the Dataset from a DataFrame, it appears that I have to provide the types of each field in the Row in the DataFrame.as[...] method.  I would think that the DataFrame would be able to do this automatically since it has all the types already.

This leads me to wonder how I should be approaching this effort.  As all the fields and types are dynamic, I cannot use beans as my type when passing data around.  Any advice would be appreciated.

Thanks,
Martin

Re: DataFrame versus Dataset creation and usage

Posted by Martin Serrano <ma...@attivio.com>.

Xinh,

Thanks for the clarification. I'm new to Spark and trying to navigate the different APIs. I was just following some examples and retrofitting them, but I see now I should stick with plain RDDs until my schema is known (at the end of the data pipeline).

Thanks again!

On 06/24/2016 04:57 PM, Xinh Huynh wrote:
Hi Martin,

Since your schema is dynamic, how would you use Datasets? Would you know ahead of time the row type T in a Dataset[T]?

One option is to start with DataFrames in the beginning of your data pipeline, figure out the field types, and then switch completely over to RDDs or Dataset in the next stage of the pipeline.

Also, I'm not sure what the custom Java mappers are doing - could you use them as UDFs within a DataFrame?

Xinh

On Fri, Jun 24, 2016 at 11:42 AM, Martin Serrano <ma...@attivio.com>> wrote:
Indeed. But I'm dealing with 1.6 for now unfortunately.

On 06/24/2016 02:30 PM, Ted Yu wrote:
In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?

On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano <ma...@attivio.com>> wrote:
Hi,

I'm exposing a custom source to the Spark environment. I have a question about the best way to approach this problem.

I created a custom relation for my source and it creates a DataFrame<Row>. My custom source knows the data types which are dynamic so this seemed to be the appropriate return type. This works fine.

The next step I want to take is to expose some custom mapping functions (written in Java). But when I look at the APIs, the map method for DataFrame returns an RDD (not a DataFrame). (Should I use SqlContext.createDataFrame on the result? -- does this result in additional processing overhead?) The Dataset type seems to be more of what I'd be looking for, it's map method returns the Dataset type. So chaining them together is a natural exercise.

But to create the Dataset from a DataFrame, it appears that I have to provide the types of each field in the Row in the DataFrame.as[...] method. I would think that the DataFrame would be able to do this automatically since it has all the types already.

This leads me to wonder how I should be approaching this effort. As all the fields and types are dynamic, I cannot use beans as my type when passing data around. Any advice would be appreciated.

Thanks,
Martin

Re: DataFrame versus Dataset creation and usage

Posted by Xinh Huynh <xi...@gmail.com>.

Hi Martin,

Since your schema is dynamic, how would you use Datasets? Would you know
ahead of time the row type T in a Dataset[T]?

One option is to start with DataFrames in the beginning of your data
pipeline, figure out the field types, and then switch completely over to
RDDs or Dataset in the next stage of the pipeline.

Also, I'm not sure what the custom Java mappers are doing - could you use
them as UDFs within a DataFrame?

Xinh

On Fri, Jun 24, 2016 at 11:42 AM, Martin Serrano <ma...@attivio.com> wrote:

> Indeed.  But I'm dealing with 1.6 for now unfortunately.
>
>
> On 06/24/2016 02:30 PM, Ted Yu wrote:
>
> In Spark 2.0, Dataset and DataFrame are unified.
>
> Would this simplify your use case ?
>
> On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano <ma...@attivio.com>
> wrote:
>
>> Hi,
>>
>> I'm exposing a custom source to the Spark environment.  I have a question
>> about the best way to approach this problem.
>>
>> I created a custom relation for my source and it creates a
>> DataFrame<Row>.  My custom source knows the data types which are
>> *dynamic* so this seemed to be the appropriate return type.  This works
>> fine.
>>
>> The next step I want to take is to expose some custom mapping functions
>> (written in Java).  But when I look at the APIs, the map method for
>> DataFrame returns an RDD (not a DataFrame).  (Should I use
>> SqlContext.createDataFrame on the result? -- does this result in additional
>> processing overhead?)  The Dataset type seems to be more of what I'd be
>> looking for, it's map method returns the Dataset type.  So chaining them
>> together is a natural exercise.
>>
>> But to create the Dataset from a DataFrame, it appears that I have to
>> provide the types of each field in the Row in the DataFrame.as[...]
>> method.  I would think that the DataFrame would be able to do this
>> automatically since it has all the types already.
>>
>> This leads me to wonder how I should be approaching this effort.  As all
>> the fields and types are dynamic, I cannot use beans as my type when
>> passing data around.  Any advice would be appreciated.
>>
>> Thanks,
>> Martin
>>
>>
>>
>>
>
>

Re: DataFrame versus Dataset creation and usage

Posted by Martin Serrano <ma...@attivio.com>.

Indeed. But I'm dealing with 1.6 for now unfortunately.

On 06/24/2016 02:30 PM, Ted Yu wrote:
In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?

On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano <ma...@attivio.com>> wrote:
Hi,

I'm exposing a custom source to the Spark environment. I have a question about the best way to approach this problem.

I created a custom relation for my source and it creates a DataFrame<Row>. My custom source knows the data types which are dynamic so this seemed to be the appropriate return type. This works fine.

This leads me to wonder how I should be approaching this effort. As all the fields and types are dynamic, I cannot use beans as my type when passing data around. Any advice would be appreciated.

Thanks,
Martin

Re: DataFrame versus Dataset creation and usage

Posted by Ted Yu <yu...@gmail.com>.

In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?

On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano <ma...@attivio.com> wrote:

> Hi,
>
> I'm exposing a custom source to the Spark environment.  I have a question
> about the best way to approach this problem.
>
> I created a custom relation for my source and it creates a
> DataFrame<Row>.  My custom source knows the data types which are *dynamic*
> so this seemed to be the appropriate return type.  This works fine.
>
> The next step I want to take is to expose some custom mapping functions
> (written in Java).  But when I look at the APIs, the map method for
> DataFrame returns an RDD (not a DataFrame).  (Should I use
> SqlContext.createDataFrame on the result? -- does this result in additional
> processing overhead?)  The Dataset type seems to be more of what I'd be
> looking for, it's map method returns the Dataset type.  So chaining them
> together is a natural exercise.
>
> But to create the Dataset from a DataFrame, it appears that I have to
> provide the types of each field in the Row in the DataFrame.as[...]
> method.  I would think that the DataFrame would be able to do this
> automatically since it has all the types already.
>
> This leads me to wonder how I should be approaching this effort.  As all
> the fields and types are dynamic, I cannot use beans as my type when
> passing data around.  Any advice would be appreciated.
>
> Thanks,
> Martin
>
>
>
>