You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ashok Kumar <as...@yahoo.com.INVALID> on 2016/09/01 14:17:09 UTC

Difference between Data set and Data Frame in Spark 2

Hi,
What are practical differences between the new Data set in Spark 2 and the existing DataFrame.
Has Dataset replaced Data Frame and what advantages it has if I use Data Frame instead of Data Frame.
Thanks

Re: Difference between Data set and Data Frame in Spark 2

Posted by Mich Talebzadeh <mi...@gmail.com>.

yes I tested that. sounds like RDD is faster.

Having said that I think there are advantages within DS over RDD.

Will RDD be phased out?

Thanks


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 September 2016 at 19:11, Maciej Bryński <ma...@brynski.pl> wrote:

> I think there could be performance reason.
> RDD can be faster than Datasets.
>
> For example check query plan for this code:
> spark.range(100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()
>
> There are two serialize / deserialize pairs.
>
> And then compare with RDD equivalent.
> sc.parallelize(1 to 100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()
>
> Regards,
> M
>
>
> 2016-09-01 18:15 GMT+02:00 Sean Owen <so...@cloudera.com>:
>
>> On Thu, Sep 1, 2016 at 4:56 PM, Mich Talebzadeh
>> <mi...@gmail.com> wrote:
>> > Data Frame built on top of RDD to create as tabular format that we all
>> love
>> > to make the original build easily usable (say SQL like queries, column
>> > headings etc). The drawback is it restricts you with what you can do
>> with
>> > Data Frame (now that you have dome RDD.toDF)
>>
>> DataFrame is a Dataset[Row], literally, rather than based on an RDD.
>>
>> > DataSet  is the new RDD with improvements on RDD. As I understand from
>> > Sean's explanation they add some optimisation on top the common RDD.
>>
>> At the moment I don't think there's any particular reason to use RDDs
>> except to interoperate with code that uses RDDs -- which is entirely
>> valid. I believe new code would generally touch only Dataset and
>> DataFrame otherwise. So I don't think there are really 3 elemental
>> concepts in play as of Spark 2.x.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
> --
> Maciek Bryński
>

Re: Difference between Data set and Data Frame in Spark 2

Posted by Maciej Bryński <ma...@brynski.pl>.

I think there could be performance reason.
RDD can be faster than Datasets.

For example check query plan for this code:
spark.range(100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()

There are two serialize / deserialize pairs.

And then compare with RDD equivalent.
sc.parallelize(1 to 100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()

Regards,
M


2016-09-01 18:15 GMT+02:00 Sean Owen <so...@cloudera.com>:

> On Thu, Sep 1, 2016 at 4:56 PM, Mich Talebzadeh
> <mi...@gmail.com> wrote:
> > Data Frame built on top of RDD to create as tabular format that we all
> love
> > to make the original build easily usable (say SQL like queries, column
> > headings etc). The drawback is it restricts you with what you can do with
> > Data Frame (now that you have dome RDD.toDF)
>
> DataFrame is a Dataset[Row], literally, rather than based on an RDD.
>
> > DataSet  is the new RDD with improvements on RDD. As I understand from
> > Sean's explanation they add some optimisation on top the common RDD.
>
> At the moment I don't think there's any particular reason to use RDDs
> except to interoperate with code that uses RDDs -- which is entirely
> valid. I believe new code would generally touch only Dataset and
> DataFrame otherwise. So I don't think there are really 3 elemental
> concepts in play as of Spark 2.x.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Maciek Bryński

Re: Difference between Data set and Data Frame in Spark 2

Posted by Sean Owen <so...@cloudera.com>.

On Thu, Sep 1, 2016 at 4:56 PM, Mich Talebzadeh
<mi...@gmail.com> wrote:
> Data Frame built on top of RDD to create as tabular format that we all love
> to make the original build easily usable (say SQL like queries, column
> headings etc). The drawback is it restricts you with what you can do with
> Data Frame (now that you have dome RDD.toDF)

DataFrame is a Dataset[Row], literally, rather than based on an RDD.

> DataSet  is the new RDD with improvements on RDD. As I understand from
> Sean's explanation they add some optimisation on top the common RDD.

At the moment I don't think there's any particular reason to use RDDs
except to interoperate with code that uses RDDs -- which is entirely
valid. I believe new code would generally touch only Dataset and
DataFrame otherwise. So I don't think there are really 3 elemental
concepts in play as of Spark 2.x.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Difference between Data set and Data Frame in Spark 2

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

This is my understanding of these three

RDD is the basic construct to prepare to spread data across the nodes. Any
form and any shape, structured, un-structured etc. It is the building block
of Spark if I may call

Data Frame built on top of RDD to create as tabular format that we all love
to make the original build easily usable (say SQL like queries, column
headings etc). The drawback is it restricts you with what you can do with
Data Frame (now that you have dome RDD.toDF)

DataSet  is the new RDD with improvements on RDD. As I understand from
Sean's explanation they add some optimisation on top the common RDD.

I guess Data Frames are as before.

Please correct me if I am wrong.

Cheers

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 September 2016 at 16:33, Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Thank you!
> The talk is indeed very good.
>
> Best,
> Ovidiu
>
> On 01 Sep 2016, at 16:47, Jules Damji <ju...@databricks.com> wrote:
>
> Sean put it succinctly the nuanced differences and the evolution of
> Datasets. Simply put, structure, to some extent, limits you—and that's what
> the DataFrames & Datasets, among other things, offer.
>
> When you want low-level control, dealing with unstructured data, blobs of
> text or images, then RDDs makes sense.
>
> There's a an illuminative talk by Michael Armbrust Structuring Spark:
> DataFrames & Datasets, where he makes an eloquent case of their merits &
> motivation, while also elaborates on RDDs.
>
> https://youtu.be/1a4pgYzeFwE
>
> Cheers
>
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
>
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> On Sep 1, 2016, at 7:35 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.marcu@inria.fr> wrote:
>
> Thank you, I like and agree with your point. RDD evolved to Datasets by
> means of an optimizer.
> I just wonder what are the use cases for RDDs (other than current version
> of GraphX leveraging RDDs)?
>
> Best,
> Ovidiu
>
> On 01 Sep 2016, at 16:26, Sean Owen <so...@cloudera.com> wrote:
>
>
> Here's my paraphrase:
>
>
> Datasets are really the new RDDs. They have a similar nature
>
> (container of strongly-typed objects) but bring some optimizations via
>
> Encoders for common types.
>
>
> DataFrames are different from RDDs and Datasets and do not replace and
>
> are not replaced by them. They're fundamentally for tabular data, not
>
> arbitrary objects, and thus supports SQL-like operations that only
>
> make sense on tabular  data.
>
>
> On Thu, Sep 1, 2016 at 3:17 PM, Ashok Kumar
>
> <as...@yahoo.com.invalid> wrote:
>
> Hi,
>
>
> What are practical differences between the new Data set in Spark 2 and the
>
> existing DataFrame.
>
>
> Has Dataset replaced Data Frame and what advantages it has if I use Data
>
> Frame instead of Data Frame.
>
>
> Thanks
>
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>

Re: Difference between Data set and Data Frame in Spark 2

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.

Thank you! 
The talk is indeed very good.

Best,
Ovidiu
> On 01 Sep 2016, at 16:47, Jules Damji <ju...@databricks.com> wrote:
> 
> Sean put it succinctly the nuanced differences and the evolution of Datasets. Simply put, structure, to some extent, limits you—and that's what the DataFrames & Datasets, among other things, offer. 
> 
> When you want low-level control, dealing with unstructured data, blobs of text or images, then RDDs makes sense.
> 
> There's a an illuminative talk by Michael Armbrust Structuring Spark: DataFrames & Datasets, where he makes an eloquent case of their merits & motivation, while also elaborates on RDDs. 
> 
> https://youtu.be/1a4pgYzeFwE <https://youtu.be/1a4pgYzeFwE>
> 
> Cheers 
> 
> Jules 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> 
> 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> On Sep 1, 2016, at 7:35 AM, Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr <ma...@inria.fr>> wrote:
> 
>> Thank you, I like and agree with your point. RDD evolved to Datasets by means of an optimizer.
>> I just wonder what are the use cases for RDDs (other than current version of GraphX leveraging RDDs)?
>> 
>> Best,
>> Ovidiu
>> 
>>> On 01 Sep 2016, at 16:26, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
>>> 
>>> Here's my paraphrase:
>>> 
>>> Datasets are really the new RDDs. They have a similar nature
>>> (container of strongly-typed objects) but bring some optimizations via
>>> Encoders for common types.
>>> 
>>> DataFrames are different from RDDs and Datasets and do not replace and
>>> are not replaced by them. They're fundamentally for tabular data, not
>>> arbitrary objects, and thus supports SQL-like operations that only
>>> make sense on tabular  data.
>>> 
>>> On Thu, Sep 1, 2016 at 3:17 PM, Ashok Kumar
>>> <ashok34668@yahoo.com.invalid <ma...@yahoo.com.invalid>> wrote:
>>>> Hi,
>>>> 
>>>> What are practical differences between the new Data set in Spark 2 and the
>>>> existing DataFrame.
>>>> 
>>>> Has Dataset replaced Data Frame and what advantages it has if I use Data
>>>> Frame instead of Data Frame.
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>

Re: Difference between Data set and Data Frame in Spark 2

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.

Thank you, I like and agree with your point. RDD evolved to Datasets by means of an optimizer.
I just wonder what are the use cases for RDDs (other than current version of GraphX leveraging RDDs)?

Best,
Ovidiu

> On 01 Sep 2016, at 16:26, Sean Owen <so...@cloudera.com> wrote:
> 
> Here's my paraphrase:
> 
> Datasets are really the new RDDs. They have a similar nature
> (container of strongly-typed objects) but bring some optimizations via
> Encoders for common types.
> 
> DataFrames are different from RDDs and Datasets and do not replace and
> are not replaced by them. They're fundamentally for tabular data, not
> arbitrary objects, and thus supports SQL-like operations that only
> make sense on tabular  data.
> 
> On Thu, Sep 1, 2016 at 3:17 PM, Ashok Kumar
> <as...@yahoo.com.invalid> wrote:
>> Hi,
>> 
>> What are practical differences between the new Data set in Spark 2 and the
>> existing DataFrame.
>> 
>> Has Dataset replaced Data Frame and what advantages it has if I use Data
>> Frame instead of Data Frame.
>> 
>> Thanks
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Difference between Data set and Data Frame in Spark 2

Posted by Sean Owen <so...@cloudera.com>.

Here's my paraphrase:

Datasets are really the new RDDs. They have a similar nature
(container of strongly-typed objects) but bring some optimizations via
Encoders for common types.

DataFrames are different from RDDs and Datasets and do not replace and
are not replaced by them. They're fundamentally for tabular data, not
arbitrary objects, and thus supports SQL-like operations that only
make sense on tabular  data.

On Thu, Sep 1, 2016 at 3:17 PM, Ashok Kumar
<as...@yahoo.com.invalid> wrote:
> Hi,
>
> What are practical differences between the new Data set in Spark 2 and the
> existing DataFrame.
>
> Has Dataset replaced Data Frame and what advantages it has if I use Data
> Frame instead of Data Frame.
>
> Thanks
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org