You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by All In A Days Work <al...@gmail.com> on 2014/04/03 06:52:40 UTC

Example of creating expressions for SchemaRDD methods

For various schemaRDD functions like select, where, orderby, groupby etc. I
would like to create expression objects and pass these to the methods for
execution.

Can someone show some examples of how to create expressions for case class
and execute ? E.g., how to create expressions for select, order by, group
by etc. and execute methods using the expressions ?

Regards,

Re: Example of creating expressions for SchemaRDD methods

Posted by Michael Armbrust <mi...@databricks.com>.
Minor typo in the example.  The first SELECT statement should actually be:

sql("SELECT * FROM src")

Where `src` is a HiveTable with schema (key INT value STRING).


On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust <mi...@databricks.com>wrote:

>
> In such construct, each operator builds on the previous one, including any
>> materialized results etc. If I use a SQL for each of them, I suspect the
>> later SQLs will not leverage the earlier SQLs by any means - hence these
>> will be inefficient to first approach. Let me know if this is not correct.
>>
>
> This is not correct.  When you run a SQL statement and register it as a
> table, it is the logical plan for this query is used when this virtual
> table is referenced in later queries, not the results.  SQL queries are
> lazy, just like RDDs and DSL queries.  This is illustrated below.
>
>
> scala> sql("SELECT * FROM selectQuery")
> res3: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[12] at RDD at SchemaRDD.scala:93
> == Query Plan ==
> HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None
>
> scala> sql("SELECT * FROM src").registerAsTable("selectQuery")
>
> scala> sql("SELECT key FROM selectQuery")
> res5: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[24] at RDD at SchemaRDD.scala:93
> == Query Plan ==
> HiveTableScan [key#8], (MetastoreRelation default, src, None), None
>
> Even though the second query is running over the "results" of the first
> query (which requested all columns using *), the optimizer is still able to
> come up with an efficient plan that avoids reading "value" from the table,
> as can be seen by the arguments of the HiveTableScan.
>
> Note that if you call sqlContext.cacheTable("selectQuery") then you are
> correct.  The results will be materialized in an in-memory columnar format,
> and subsequent queries will be run over these materialized results.
>
>
>> The reason for building expressions is that the use case needs these to
>> be created on the fly based on some case class at runtime.
>>
>> I.e., I can't type these in REPL. The scala code will define some case
>> class A (a: ... , b: ..., c: ... ) where class name, member names and types
>> will be known before hand and the RDD will be defined on this. Then based
>> on user action, above pipeline needs to be constructed on fly. Thus the
>> expressions has to be constructed on fly from class members and other
>> predicates etc., most probably using expression constructors.
>>
>> Could you please share how expressions could be constructed using the
>> APIs on expression (and not on REPL) ?
>>
>
> I'm not sure I completely understand the use case here, but you should be
> able to construct symbols and use the DSL to create expressions at runtime,
> just like in the REPL.
>
> val attrName: String = "name"
> val addExpression: Expression = Symbol(attrName) + Symbol(attrName)
>
> There is currently no public API for constructing expressions manually
> other than SQL or the DSL.  While you could dig into
> org.apache.spark.sql.catalyst.expressions._, these APIs are considered
> internal, and *will not be stable in between versions*.
>
> Michael
>
>
>
>

Re: Example of creating expressions for SchemaRDD methods

Posted by Michael Armbrust <mi...@databricks.com>.
> In such construct, each operator builds on the previous one, including any
> materialized results etc. If I use a SQL for each of them, I suspect the
> later SQLs will not leverage the earlier SQLs by any means - hence these
> will be inefficient to first approach. Let me know if this is not correct.
>

This is not correct.  When you run a SQL statement and register it as a
table, it is the logical plan for this query is used when this virtual
table is referenced in later queries, not the results.  SQL queries are
lazy, just like RDDs and DSL queries.  This is illustrated below.


scala> sql("SELECT * FROM selectQuery")
res3: org.apache.spark.sql.SchemaRDD =
SchemaRDD[12] at RDD at SchemaRDD.scala:93
== Query Plan ==
HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None

scala> sql("SELECT * FROM src").registerAsTable("selectQuery")

scala> sql("SELECT key FROM selectQuery")
res5: org.apache.spark.sql.SchemaRDD =
SchemaRDD[24] at RDD at SchemaRDD.scala:93
== Query Plan ==
HiveTableScan [key#8], (MetastoreRelation default, src, None), None

Even though the second query is running over the "results" of the first
query (which requested all columns using *), the optimizer is still able to
come up with an efficient plan that avoids reading "value" from the table,
as can be seen by the arguments of the HiveTableScan.

Note that if you call sqlContext.cacheTable("selectQuery") then you are
correct.  The results will be materialized in an in-memory columnar format,
and subsequent queries will be run over these materialized results.


> The reason for building expressions is that the use case needs these to be
> created on the fly based on some case class at runtime.
>
> I.e., I can't type these in REPL. The scala code will define some case
> class A (a: ... , b: ..., c: ... ) where class name, member names and types
> will be known before hand and the RDD will be defined on this. Then based
> on user action, above pipeline needs to be constructed on fly. Thus the
> expressions has to be constructed on fly from class members and other
> predicates etc., most probably using expression constructors.
>
> Could you please share how expressions could be constructed using the APIs
> on expression (and not on REPL) ?
>

I'm not sure I completely understand the use case here, but you should be
able to construct symbols and use the DSL to create expressions at runtime,
just like in the REPL.

val attrName: String = "name"
val addExpression: Expression = Symbol(attrName) + Symbol(attrName)

There is currently no public API for constructing expressions manually
other than SQL or the DSL.  While you could dig into
org.apache.spark.sql.catalyst.expressions._, these APIs are considered
internal, and *will not be stable in between versions*.

Michael

Re: Example of creating expressions for SchemaRDD methods

Posted by All In A Days Work <al...@gmail.com>.
Hi Michael,

The idea is to build a pipeline of operators on RDD, leveraging existing
operations already done.

E.g.

rdd1 = rdd.select(...).
rdd2 = rdd1.where(....).
rdd3 = rdd2.groupBy(....) etc.

In such construct, each operator builds on the previous one, including any
materialized results etc. If I use a SQL for each of them, I suspect the
later SQLs will not leverage the earlier SQLs by any means - hence these
will be inefficient to first approach. Let me know if this is not correct.

The reason for building expressions is that the use case needs these to be
created on the fly based on some case class at runtime.

I.e., I can't type these in REPL. The scala code will define some case
class A (a: ... , b: ..., c: ... ) where class name, member names and types
will be known before hand and the RDD will be defined on this. Then based
on user action, above pipeline needs to be constructed on fly. Thus the
expressions has to be constructed on fly from class members and other
predicates etc., most probably using expression constructors.

Could you please share how expressions could be constructed using the APIs
on expression (and not on REPL) ?

Regards,




On Thu, Apr 3, 2014 at 11:35 AM, Michael Armbrust <mi...@databricks.com>wrote:

> I'll start off by saying that the DSL is pretty experimental, and we are
> still figuring out exactly how to expose all of it to end users.  Right now
> you are going to get more full featured functionality from SQL.  Under the
> covers, using these two representations results in the same execution plans.
>
> That said, you can create expressions using implicit conversions that are
> provided when you import a SQLContext.
>
> The leaves of these expression trees will be either `Attributes` (columns
> of the tables), or `Literals` (constant values).  You can represent an
> attribute by creating a Scala Symbol<http://daily-scala.blogspot.com/2010/01/symbols.html> (an
> identifier that is prefixed with a ').  When you perform an operation on a
> Symbol it gets converted into an attribute, and an expression is returned.
> scala> 'a + 1
> res7: org.apache.spark.sql.catalyst.expressions.Add = ('a + 1)
>
> You can also construct more complex expressions:
> scala> 'b + 1 === 3 && 'c === 2
> res9: org.apache.spark.sql.catalyst.expressions.And = ((('b + 1) = 3) &&
> ('c = 2))
>
> Note that sometimes we have to use special operators to avoid falling back
> on the standard JVM implementation.  For example since symbols already have
> a method ==, doing the following does not give us what we want.
> scala> 'a == 'b
> res13: Boolean = false
>
> In constrast, since === is not defined on a symbol, the compiler is forced
> to use an implicit conversion from the SQLContext.  This gives us the
> desired result.
> scala> 'a === 'b
> res14: org.apache.spark.sql.catalyst.expressions.Equals = ('a = 'b)
>
> Let me know if you have further questions!
>
> Michael
>
>
> On Wed, Apr 2, 2014 at 9:52 PM, All In A Days Work <
> allinadayswrk@gmail.com> wrote:
>
>> For various schemaRDD functions like select, where, orderby, groupby etc.
>> I would like to create expression objects and pass these to the methods for
>> execution.
>>
>> Can someone show some examples of how to create expressions for case
>> class and execute ? E.g., how to create expressions for select, order by,
>> group by etc. and execute methods using the expressions ?
>>
>> Regards,
>>
>
>

Re: Example of creating expressions for SchemaRDD methods

Posted by Michael Armbrust <mi...@databricks.com>.
I'll start off by saying that the DSL is pretty experimental, and we are
still figuring out exactly how to expose all of it to end users.  Right now
you are going to get more full featured functionality from SQL.  Under the
covers, using these two representations results in the same execution plans.

That said, you can create expressions using implicit conversions that are
provided when you import a SQLContext.

The leaves of these expression trees will be either `Attributes` (columns
of the tables), or `Literals` (constant values).  You can represent an
attribute by creating a Scala
Symbol<http://daily-scala.blogspot.com/2010/01/symbols.html> (an
identifier that is prefixed with a ').  When you perform an operation on a
Symbol it gets converted into an attribute, and an expression is returned.
scala> 'a + 1
res7: org.apache.spark.sql.catalyst.expressions.Add = ('a + 1)

You can also construct more complex expressions:
scala> 'b + 1 === 3 && 'c === 2
res9: org.apache.spark.sql.catalyst.expressions.And = ((('b + 1) = 3) &&
('c = 2))

Note that sometimes we have to use special operators to avoid falling back
on the standard JVM implementation.  For example since symbols already have
a method ==, doing the following does not give us what we want.
scala> 'a == 'b
res13: Boolean = false

In constrast, since === is not defined on a symbol, the compiler is forced
to use an implicit conversion from the SQLContext.  This gives us the
desired result.
scala> 'a === 'b
res14: org.apache.spark.sql.catalyst.expressions.Equals = ('a = 'b)

Let me know if you have further questions!

Michael


On Wed, Apr 2, 2014 at 9:52 PM, All In A Days Work
<al...@gmail.com>wrote:

> For various schemaRDD functions like select, where, orderby, groupby etc.
> I would like to create expression objects and pass these to the methods for
> execution.
>
> Can someone show some examples of how to create expressions for case class
> and execute ? E.g., how to create expressions for select, order by, group
> by etc. and execute methods using the expressions ?
>
> Regards,
>