You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Arun Patel <ar...@gmail.com> on 2016/06/13 11:01:15 UTC

Spark 2.0: Unify DataFrames and Datasets question

In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an
alias for a Dataset of type row.   I have few questions.

1) What does this really mean to an Application developer?
2) Why this unification was needed in Spark 2.0?
3) What changes can be observed in Spark 2.0 vs Spark 1.6?
4) Compile time safety will be there for DataFrames too?
5) Python API is supported for Datasets in 2.0?

Thanks
Arun

Re: Spark 2.0: Unify DataFrames and Datasets question

Posted by Xinh Huynh <xi...@gmail.com>.

Hi Arun,

This documentation may be helpful:

The 2.0-preview Scala doc for Dataset class:
http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Dataset
Note that the Dataset API has completely changed from 1.6.

In 2.0, there is no separate DataFrame class. Rather, it is a type alias
defined here:
http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.package@DataFrame=org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
"type DataFrame = Dataset
<http://spark.apache.org/docs/2.0.0-preview/api/scala/org/apache/spark/sql/Dataset.html>
[Row
<http://spark.apache.org/docs/2.0.0-preview/api/scala/org/apache/spark/sql/Row.html>
]"
Unlike in 1.6, a DataFrame is a specific Dataset[T], where T=Row, so
DataFrame shares the same methods as Dataset.

As mentioned earlier, this unification is only available in Scala and Java.

Xinh

On Tue, Jun 14, 2016 at 10:45 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> 1) What does this really mean to an Application developer?
>>
>
> It means there are less concepts to learn.
>
>
>> 2) Why this unification was needed in Spark 2.0?
>>
>
> To simplify the API and reduce the number of concepts that needed to be
> learned.  We only didn't do it in 1.6 because we didn't want to break
> binary compatibility in a minor release.
>
>
>> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>>
>
> There is no DataFrame class, all methods are still available, except those
> that returned an RDD (now you can call df.rdd.map if that is still what you
> want)
>
>
>> 4) Compile time safety will be there for DataFrames too?
>>
>
> Slide 7
>
>
>> 5) Python API is supported for Datasets in 2.0?
>>
>
> Slide 10
>

Re: Spark 2.0: Unify DataFrames and Datasets question

Posted by Michael Armbrust <mi...@databricks.com>.

>
> 1) What does this really mean to an Application developer?
>

It means there are less concepts to learn.


> 2) Why this unification was needed in Spark 2.0?
>

To simplify the API and reduce the number of concepts that needed to be
learned.  We only didn't do it in 1.6 because we didn't want to break
binary compatibility in a minor release.


> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>

There is no DataFrame class, all methods are still available, except those
that returned an RDD (now you can call df.rdd.map if that is still what you
want)


> 4) Compile time safety will be there for DataFrames too?
>

Slide 7


> 5) Python API is supported for Datasets in 2.0?
>

Slide 10

Re: Spark 2.0: Unify DataFrames and Datasets question

Posted by Arun Patel <ar...@gmail.com>.

Can anyone answer these questions please.



On Mon, Jun 13, 2016 at 6:51 PM, Arun Patel <ar...@gmail.com> wrote:

> Thanks Michael.
>
> I went thru these slides already and could not find answers for these
> specific questions.
>
> I created a Dataset and converted it to DataFrame in 1.6 and 2.0.  I don't
> see any difference in 1.6 vs 2.0.  So, I really got confused and asked
> these questions about unification.
>
> Appreciate if you can answer these specific questions.  Thank you very
> much!
>
> On Mon, Jun 13, 2016 at 2:55 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Here's a talk I gave on the topic:
>>
>> https://www.youtube.com/watch?v=i7l3JQRx7Qw
>>
>> http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust
>>
>> On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel <ar...@gmail.com>
>> wrote:
>>
>>> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply
>>> an alias for a Dataset of type row.   I have few questions.
>>>
>>> 1) What does this really mean to an Application developer?
>>> 2) Why this unification was needed in Spark 2.0?
>>> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>>> 4) Compile time safety will be there for DataFrames too?
>>> 5) Python API is supported for Datasets in 2.0?
>>>
>>> Thanks
>>> Arun
>>>
>>
>>
>

Re: Spark 2.0: Unify DataFrames and Datasets question

Posted by Arun Patel <ar...@gmail.com>.

Thanks Michael.

I went thru these slides already and could not find answers for these
specific questions.

I created a Dataset and converted it to DataFrame in 1.6 and 2.0.  I don't
see any difference in 1.6 vs 2.0.  So, I really got confused and asked
these questions about unification.

Appreciate if you can answer these specific questions.  Thank you very much!

On Mon, Jun 13, 2016 at 2:55 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Here's a talk I gave on the topic:
>
> https://www.youtube.com/watch?v=i7l3JQRx7Qw
>
> http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust
>
> On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel <ar...@gmail.com>
> wrote:
>
>> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an
>> alias for a Dataset of type row.   I have few questions.
>>
>> 1) What does this really mean to an Application developer?
>> 2) Why this unification was needed in Spark 2.0?
>> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>> 4) Compile time safety will be there for DataFrames too?
>> 5) Python API is supported for Datasets in 2.0?
>>
>> Thanks
>> Arun
>>
>
>

Re: Spark 2.0: Unify DataFrames and Datasets question

Posted by Michael Armbrust <mi...@databricks.com>.

Here's a talk I gave on the topic:

https://www.youtube.com/watch?v=i7l3JQRx7Qw
http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust

On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel <ar...@gmail.com> wrote:

> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an
> alias for a Dataset of type row.   I have few questions.
>
> 1) What does this really mean to an Application developer?
> 2) Why this unification was needed in Spark 2.0?
> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
> 4) Compile time safety will be there for DataFrames too?
> 5) Python API is supported for Datasets in 2.0?
>
> Thanks
> Arun
>