You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by brccosta <br...@gmail.com> on 2016/07/07 11:20:14 UTC

RDD and Dataframes

Dear guys,

I'm investigating the differences between RDDs and Dataframes/Datasets. I
couldn't find the answer for this question: Dataframes acts as a new layer
in the Spark stack? I mean, in the execution there is a conversion to RDD?

For example, if I create a Dataframe and perform a query, in the final step
it will be transformed into a RDD to be executed in Spark?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: RDD and Dataframes

Posted by "Taotao.Li" <ch...@gmail.com>.

hi, brccosta, databricks have just posted a blog about *RDD, Dataframe and
Dataset*, you can check it here :
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
 , which will be very helpful for you I think.

*___________________*
Quant | Engineer | Boy
*___________________*
*blog*:    http://litaotao.github.io
<http://litaotao.github.io/?utm_source=spark_mail>
*github*: www.github.com/litaotao


On Sat, Jul 16, 2016 at 7:53 AM, RK Aduri <rk...@collectivei.com> wrote:

> DataFrames uses RDDs as internal implementation of its structure. It
> doesn't
> convert to RDD but uses RDD partitions to produce logical plan.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306p27346.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
*___________________*
Quant | Engineer | Boy
*___________________*
*blog*:    http://litaotao.github.io
<http://litaotao.github.io?utm_source=spark_mail>
*github*: www.github.com/litaotao

Re: RDD and Dataframes

Posted by RK Aduri <rk...@collectivei.com>.

DataFrames uses RDDs as internal implementation of its structure. It doesn't
convert to RDD but uses RDD partitions to produce logical plan.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306p27346.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: RDD and Dataframes

Posted by Bruno Costa <br...@gmail.com>.

Thank you for the answer.

One of the optimizations of Dataframes/Datasets (beyond the Catalyst) are
the Encoders (Project Tungsten), which translate domain objects into
Spark's internal format (binary). By using encoders, the data is not
managed by the Java Virtual Machine anymore (which increase the memory
using with metadata, and the processing time with Garbage Collector
actuation). However, if it will be converted to an RDD internally, such RDD
will also not be managed by JVM, is that right? Instead, there weren't
really optimization with enconders...

2016-07-07 9:10 GMT-03:00 Rishi Mishra <rm...@snappydata.io>:

> Yes, finally it will be converted to an RDD internally. However DataFrame
> queries are passed through catalyst , which provides several optimizations
> e.g. code generation, intelligent shuffle etc , which is not the case for
> pure RDDs.
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Thu, Jul 7, 2016 at 4:50 PM, brccosta <br...@gmail.com> wrote:
>
>> Dear guys,
>>
>> I'm investigating the differences between RDDs and Dataframes/Datasets. I
>> couldn't find the answer for this question: Dataframes acts as a new layer
>> in the Spark stack? I mean, in the execution there is a conversion to RDD?
>>
>> For example, if I create a Dataframe and perform a query, in the final
>> step
>> it will be transformed into a RDD to be executed in Spark?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

-- 
 Bruno.

Re: RDD and Dataframes

Posted by Rishi Mishra <rm...@snappydata.io>.

Yes, finally it will be converted to an RDD internally. However DataFrame
queries are passed through catalyst , which provides several optimizations
e.g. code generation, intelligent shuffle etc , which is not the case for
pure RDDs.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Thu, Jul 7, 2016 at 4:50 PM, brccosta <br...@gmail.com> wrote:

> Dear guys,
>
> I'm investigating the differences between RDDs and Dataframes/Datasets. I
> couldn't find the answer for this question: Dataframes acts as a new layer
> in the Spark stack? I mean, in the execution there is a conversion to RDD?
>
> For example, if I create a Dataframe and perform a query, in the final step
> it will be transformed into a RDD to be executed in Spark?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>