You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Cesar Flores <ce...@gmail.com> on 2015/03/10 22:13:31 UTC

SchemaRDD: SQL Queries vs Language Integrated Queries

I am new to the SchemaRDD class, and I am trying to decide in using SQL
queries or Language Integrated Queries (
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
).

Can someone tell me what is the main difference between the two approaches,
besides using different syntax? Are they interchangeable? Which one has
better performance?


Thanks a lot
-- 
Cesar Flores

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

Posted by Tobias Pfeiffer <tg...@preferred.jp>.
Hi,

On Wed, Mar 11, 2015 at 11:05 PM, Cesar Flores <ce...@gmail.com> wrote:
>
> Thanks for both answers. One final question. *This registerTempTable is
> not an extra process that the SQL queries need to do that may decrease
> performance over the language integrated method calls? *
>

As far as I know, registerTempTable is just a Map[String, SchemaRDD]
insertion, nothing that would be measurable. But there are no
distributed/RDD operations involved, I think.

Tobias

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

Posted by Cesar Flores <ce...@gmail.com>.
Hi:

Thanks for both answers. One final question. *This registerTempTable is not
an extra process that the SQL queries need to do that may decrease
performance over the language integrated method calls? *The thing is that I
am planning to use them in the current version of the ML Pipeline
transformers classes for feature extraction, and If I need to save the
input and maybe output SchemaRDD of the transform function in every
transformer, this may not very efficient.


Thanks

On Tue, Mar 10, 2015 at 8:20 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> Hi,
>
> On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores <ce...@gmail.com> wrote:
>
>> I am new to the SchemaRDD class, and I am trying to decide in using SQL
>> queries or Language Integrated Queries (
>> https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
>> ).
>>
>> Can someone tell me what is the main difference between the two
>> approaches, besides using different syntax? Are they interchangeable? Which
>> one has better performance?
>>
>
> One difference is that the language integrated queries are method calls on
> the SchemaRDD you want to work on, which requires you have access to the
> object at hand. The SQL queries are passed to a method of the SQLContext
> and you have to call registerTempTable() on the SchemaRDD you want to use
> beforehand, which can basically happen at an arbitrary location of your
> program. (I don't know if I could express what I wanted to say.) That may
> have an influence on how you design your program and how the different
> parts work together.
>
> Tobias
>



-- 
Cesar Flores

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

Posted by Tobias Pfeiffer <tg...@preferred.jp>.
Hi,

On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores <ce...@gmail.com> wrote:

> I am new to the SchemaRDD class, and I am trying to decide in using SQL
> queries or Language Integrated Queries (
> https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
> ).
>
> Can someone tell me what is the main difference between the two
> approaches, besides using different syntax? Are they interchangeable? Which
> one has better performance?
>

One difference is that the language integrated queries are method calls on
the SchemaRDD you want to work on, which requires you have access to the
object at hand. The SQL queries are passed to a method of the SQLContext
and you have to call registerTempTable() on the SchemaRDD you want to use
beforehand, which can basically happen at an arbitrary location of your
program. (I don't know if I could express what I wanted to say.) That may
have an influence on how you design your program and how the different
parts work together.

Tobias

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

Posted by Reynold Xin <rx...@databricks.com>.
They should have the same performance, as they are compiled down to the
same execution plan.

Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame:

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html



On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores <ce...@gmail.com> wrote:

>
> I am new to the SchemaRDD class, and I am trying to decide in using SQL
> queries or Language Integrated Queries (
> https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
> ).
>
> Can someone tell me what is the main difference between the two
> approaches, besides using different syntax? Are they interchangeable? Which
> one has better performance?
>
>
> Thanks a lot
> --
> Cesar Flores
>