You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ashok Kumar <as...@yahoo.com.INVALID> on 2016/02/16 17:05:44 UTC

Use case for RDD and Data Frame

 Gurus,
What are the main differences between a Resilient Distributed Data (RDD) and Data Frame (DF)
Where one can use RDD without transforming it to DF?
Regards and obliged

Re: Use case for RDD and Data Frame

Posted by Andy Grove <an...@agildata.com>.

This blog post should be helpful

http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/


Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Tue, Feb 16, 2016 at 9:05 AM, Ashok Kumar <as...@yahoo.com.invalid>
wrote:

> Gurus,
>
> What are the main differences between a Resilient Distributed Data (RDD)
> and Data Frame (DF)
>
> Where one can use RDD without transforming it to DF?
>
> Regards and obliged
>

Re: Use case for RDD and Data Frame

Posted by Chandeep Singh <cs...@chandeep.com>.

Ah. My bad! :)

> On Feb 16, 2016, at 6:24 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> Thanks Chandeep.
>  
> Andy Grove, the author earlier on pointed to that article in an earlier thread J
>  
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
>  
>  
> From: Chandeep Singh [mailto:cs@chandeep.com] 
> Sent: 16 February 2016 18:17
> To: Mich Talebzadeh <mi...@peridale.co.uk>
> Cc: Ashok Kumar <as...@yahoo.com>; User <us...@spark.apache.org>
> Subject: Re: Use case for RDD and Data Frame
>  
> Here is another interesting post.
>  
> http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer <http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer>
>  
>> On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk>> wrote:
>>  
>> Hi,
>>  
>> A Resilient  Distributed Dataset (RDD) is a heap of data distributed among all nodes of cluster. It is basically raw data and that is all about it with little optimization on it. Remember data is not much of a value until it is turned into information.
>>  
>> On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a table in Oracle or Sybase. In other words a two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
>>  
>> So, a DataFrame by definition has additional metadata due to its tabular format, which allows Spark Optimizer AKA Catalyst  to take advantage of this tabular format for certain optimizations. So still after so many years, the relational model is arguably the most elegant model known and used and emulated everywhere. 
>>  
>> Much like a table in RDBMS, a DataFrame keeps track of the schema and supports various relational operations that lead to more optimized execution. Essentially each DataFrame object represents a logical plan but because of their "lazy" nature no execution occurs until the user calls a specific "output operation". This is very important to remember. You can go from a DataFrame to an RDD via its rdd method. You can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method.
>>  
>> In general it is recommended to use a DataFrame where possible due to the built in query optimization.
>>  
>> For those familiar with SQL a DataFrame can be conveniently registered as a temporary table and SQL operations can be performed on it.
>>  
>> Case in point I am looking for all my replication server log files compressed and stored in an HDFS directory for error on a specific connection
>>  
>> //create an RDD
>> val rdd = sc.textFile("/test/REP_LOG.gz")
>> //convert it to Data Frame
>> val df = rdd.toDF("line")
>> //register the line as a temporary table
>> df.registerTempTable("t")
>> println("\n Search for ERROR plus another word in table t\n")
>> sql("select * from t WHERE line like '%ERROR%' and line like '%hiveserver2.asehadoop%'").collect().foreach(println)
>>  
>> Alternatively you can just use method calls on the DataFrame itself to filter out the word
>>  
>> df.filter(col("line").like("%ERROR%")).collect.foreach(println)
>>  
>> HTH,
>>  
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
>>  
>>  
>> From: Ashok Kumar [mailto:ashok34668@yahoo.com.INVALID <ma...@yahoo.com.INVALID>] 
>> Sent: 16 February 2016 16:06
>> To: User <user@spark.apache.org <ma...@spark.apache.org>>
>> Subject: Use case for RDD and Data Frame
>>  
>> Gurus,
>>  
>> What are the main differences between a Resilient Distributed Data (RDD) and Data Frame (DF)
>>  
>> Where one can use RDD without transforming it to DF?
>>  
>> Regards and obliged

RE: Use case for RDD and Data Frame

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Thanks Chandeep.

 

Andy Grove, the author earlier on pointed to that article in an earlier
thread :)

 

 

 

Dr Mich Talebzadeh

 

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 

From: Chandeep Singh [mailto:cs@chandeep.com] 
Sent: 16 February 2016 18:17
To: Mich Talebzadeh <mi...@peridale.co.uk>
Cc: Ashok Kumar <as...@yahoo.com>; User <us...@spark.apache.org>
Subject: Re: Use case for RDD and Data Frame

 

Here is another interesting post.

 

http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm
_content=buffer31ce5
<http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?ut
m_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=
buffer> &utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

 

On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh <mich@peridale.co.uk
<ma...@peridale.co.uk> > wrote:

 

Hi,

 

A Resilient  Distributed Dataset (RDD) is a heap of data distributed among
all nodes of cluster. It is basically raw data and that is all about it with
little optimization on it. Remember data is not much of a value until it is
turned into information.

 

On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a
table in Oracle or Sybase. In other words a two-dimensional array-like
structure, in which each column contains measurements on one variable, and
each row contains one case.

 

So, a DataFrame by definition has additional metadata due to its tabular
format, which allows Spark Optimizer AKA Catalyst  to take advantage of this
tabular format for certain optimizations. So still after so many years, the
relational model is arguably the most elegant model known and used and
emulated everywhere. 

 

Much like a table in RDBMS, a DataFrame keeps track of the schema and
supports various relational operations that lead to more optimized
execution. Essentially each DataFrame object represents a logical plan but
because of their "lazy" nature no execution occurs until the user calls a
specific "output operation". This is very important to remember. You can go
from a DataFrame to an RDD via its rdd method. You can go from an RDD to a
DataFrame (if the RDD is in a tabular format) via the toDF method.

 

In general it is recommended to use a DataFrame where possible due to the
built in query optimization.

 

For those familiar with SQL a DataFrame can be conveniently registered as a
temporary table and SQL operations can be performed on it.

 

Case in point I am looking for all my replication server log files
compressed and stored in an HDFS directory for error on a specific
connection

 

//create an RDD

val rdd = sc.textFile("/test/REP_LOG.gz")

//convert it to Data Frame

val df = rdd.toDF("line")

//register the line as a temporary table

df.registerTempTable("t")

println("\n Search for ERROR plus another word in table t\n")

sql("select * from t WHERE line like '%ERROR%' and line like
'%hiveserver2.asehadoop%'").collect().foreach(println)

 

Alternatively you can just use method calls on the DataFrame itself to
filter out the word

 

df.filter(col("line").like("%ERROR%")).collect.foreach(println)

 

HTH,

 

Dr Mich Talebzadeh

 

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 

From: Ashok Kumar [ <ma...@yahoo.com.INVALID>
mailto:ashok34668@yahoo.com.INVALID] 
Sent: 16 February 2016 16:06
To: User < <ma...@spark.apache.org> user@spark.apache.org>
Subject: Use case for RDD and Data Frame

 

Gurus,

 

What are the main differences between a Resilient Distributed Data (RDD) and
Data Frame (DF)

 

Where one can use RDD without transforming it to DF?

 

Regards and obliged

Re: Use case for RDD and Data Frame

Posted by Chandeep Singh <cs...@chandeep.com>.

Here is another interesting post.

http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer <http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer>

> On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> Hi,
>  
> A Resilient  Distributed Dataset (RDD) is a heap of data distributed among all nodes of cluster. It is basically raw data and that is all about it with little optimization on it. Remember data is not much of a value until it is turned into information.
>  
> On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a table in Oracle or Sybase. In other words a two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
>  
> So, a DataFrame by definition has additional metadata due to its tabular format, which allows Spark Optimizer AKA Catalyst  to take advantage of this tabular format for certain optimizations. So still after so many years, the relational model is arguably the most elegant model known and used and emulated everywhere. 
>  
> Much like a table in RDBMS, a DataFrame keeps track of the schema and supports various relational operations that lead to more optimized execution. Essentially each DataFrame object represents a logical plan but because of their "lazy" nature no execution occurs until the user calls a specific "output operation". This is very important to remember. You can go from a DataFrame to an RDD via its rdd method. You can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method.
>  
> In general it is recommended to use a DataFrame where possible due to the built in query optimization.
>  
> For those familiar with SQL a DataFrame can be conveniently registered as a temporary table and SQL operations can be performed on it.
>  
> Case in point I am looking for all my replication server log files compressed and stored in an HDFS directory for error on a specific connection
>  
> //create an RDD
> val rdd = sc.textFile("/test/REP_LOG.gz")
> //convert it to Data Frame
> val df = rdd.toDF("line")
> //register the line as a temporary table
> df.registerTempTable("t")
> println("\n Search for ERROR plus another word in table t\n")
> sql("select * from t WHERE line like '%ERROR%' and line like '%hiveserver2.asehadoop%'").collect().foreach(println)
>  
> Alternatively you can just use method calls on the DataFrame itself to filter out the word
>  
> df.filter(col("line").like("%ERROR%")).collect.foreach(println)
>  
> HTH,
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
>  
>  
> From: Ashok Kumar [mailto:ashok34668@yahoo.com.INVALID <ma...@yahoo.com.INVALID>] 
> Sent: 16 February 2016 16:06
> To: User <user@spark.apache.org <ma...@spark.apache.org>>
> Subject: Use case for RDD and Data Frame
>  
> Gurus,
>  
> What are the main differences between a Resilient Distributed Data (RDD) and Data Frame (DF)
>  
> Where one can use RDD without transforming it to DF?
>  
> Regards and obliged

RE: Use case for RDD and Data Frame

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Hi,

 

A Resilient  Distributed Dataset (RDD) is a heap of data distributed among all nodes of cluster. It is basically raw data and that is all about it with little optimization on it. Remember data is not much of a value until it is turned into information.

 

On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a table in Oracle or Sybase. In other words a two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

 

So, a DataFrame by definition has additional metadata due to its tabular format, which allows Spark Optimizer AKA Catalyst  to take advantage of this tabular format for certain optimizations. So still after so many years, the relational model is arguably the most elegant model known and used and emulated everywhere. 

 

Much like a table in RDBMS, a DataFrame keeps track of the schema and supports various relational operations that lead to more optimized execution. Essentially each DataFrame object represents a logical plan but because of their "lazy" nature no execution occurs until the user calls a specific "output operation". This is very important to remember. You can go from a DataFrame to an RDD via its rdd method. You can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method.

 

In general it is recommended to use a DataFrame where possible due to the built in query optimization.

 

For those familiar with SQL a DataFrame can be conveniently registered as a temporary table and SQL operations can be performed on it.

 

Case in point I am looking for all my replication server log files compressed and stored in an HDFS directory for error on a specific connection

 

//create an RDD

val rdd = sc.textFile("/test/REP_LOG.gz")

//convert it to Data Frame

val df = rdd.toDF("line")

//register the line as a temporary table

df.registerTempTable("t")

println("\n Search for ERROR plus another word in table t\n")

sql("select * from t WHERE line like '%ERROR%' and line like '%hiveserver2.asehadoop%'").collect().foreach(println)

 

Alternatively you can just use method calls on the DataFrame itself to filter out the word

 

df.filter(col("line").like("%ERROR%")).collect.foreach(println)

 

HTH,

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

From: Ashok Kumar [mailto:ashok34668@yahoo.com.INVALID] 
Sent: 16 February 2016 16:06
To: User <us...@spark.apache.org>
Subject: Use case for RDD and Data Frame

 

Gurus,

 

What are the main differences between a Resilient Distributed Data (RDD) and Data Frame (DF)

 

Where one can use RDD without transforming it to DF?

 

Regards and obliged