You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yana Kadiyska <ya...@gmail.com> on 2015/11/05 23:34:01 UTC

Need advice on hooking into Sql query plan

Hi folks, not sure if this belongs to dev or user list..sending to dev as
it seems a bit convoluted.

I have a UI in which we allow users to write ad-hoc queries against a (very
large, partitioned) table. I would like to analyze the queries prior to
execution for two purposes:

1. Reject under-constrained queries (i.e. there is a field predicate that I
want to make sure is always present)
2. Augment the query with additional predicates (e.g if the user asks for a
student_id I also want to push a constraint on another field)

I could parse the sql string before passing to spark but obviously spark
already does this anyway. Can someone give me general direction on how to
do this (if possible).

Something like

myDF = sql("user_sql_query")
myDF.queryExecution.logical  //here examine the filters provided by user,
reject if underconstrained, push new filters as needed (via
withNewChildren?)

at this point with some luck I'd have a new LogicalPlan -- what is the
proper way to create an execution plan on top of this new Plan? Im looking
at this
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
but this method is restricted to the package. I'd really prefer to hook
into this as early as possible and still let spark run the plan
optimizations as usual.

Any guidance or pointers much appreciated.

Re: Need advice on hooking into Sql query plan

Posted by Reynold Xin <rx...@databricks.com>.
You can hack around this by constructing logical plans yourself and then
creating a DataFrame in order to execute them. Note that this is all
depending on internals of the framework and can break when Spark upgrades.


On Thu, Nov 5, 2015 at 4:18 PM, Yana Kadiyska <ya...@gmail.com>
wrote:

> I don't think a view would help -- in the case of under-constraining, I
> want to make sure that the user is constraining a column (e.g. I want to
> restrict them to querying a single partition at a time but I don't care
> which one)...a view per partition value is not practical due to the fairly
> high cardinality...
>
> In the case of predicate augmentation, the additional predicate depends on
> the value the user is providing e.g. my data is partitioned under
> teacherName but the end users don't have this information...So if they ask
> for student_id="1234" I'd like to add "teacherName='Smith'" based on a
> mapping that is not surfaced to the user (sorry for the contrived
> example)...But I don't think I can do this with a view. A join will produce
> the right answer but is counter-productive as my goal is to minimize the
> partitions being processed.
>
> I can parse the query myself -- I was not fond of this solution as I'd go
> sql string to parse tree back to augmented sql string only to have spark
> repeat the first part of the exercise....but will do if need be. And yes,
> I'd have to be able to process sub-queries too...
>
> On Thu, Nov 5, 2015 at 5:50 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Would it be possible to use views to address some of your requirements?
>>
>> Alternatively it might be better to parse it yourself. There are open
>> source libraries for it, if you need really a complete sql parser. Do you
>> want to do it on sub queries?
>>
>> On 05 Nov 2015, at 23:34, Yana Kadiyska <ya...@gmail.com> wrote:
>>
>> Hi folks, not sure if this belongs to dev or user list..sending to dev as
>> it seems a bit convoluted.
>>
>> I have a UI in which we allow users to write ad-hoc queries against a
>> (very large, partitioned) table. I would like to analyze the queries prior
>> to execution for two purposes:
>>
>> 1. Reject under-constrained queries (i.e. there is a field predicate that
>> I want to make sure is always present)
>> 2. Augment the query with additional predicates (e.g if the user asks for
>> a student_id I also want to push a constraint on another field)
>>
>> I could parse the sql string before passing to spark but obviously spark
>> already does this anyway. Can someone give me general direction on how to
>> do this (if possible).
>>
>> Something like
>>
>> myDF = sql("user_sql_query")
>> myDF.queryExecution.logical  //here examine the filters provided by
>> user, reject if underconstrained, push new filters as needed (via
>> withNewChildren?)
>>
>> at this point with some luck I'd have a new LogicalPlan -- what is the
>> proper way to create an execution plan on top of this new Plan? Im looking
>> at this
>> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
>> but this method is restricted to the package. I'd really prefer to hook
>> into this as early as possible and still let spark run the plan
>> optimizations as usual.
>>
>> Any guidance or pointers much appreciated.
>>
>>
>

Re: Need advice on hooking into Sql query plan

Posted by Yana Kadiyska <ya...@gmail.com>.
I don't think a view would help -- in the case of under-constraining, I
want to make sure that the user is constraining a column (e.g. I want to
restrict them to querying a single partition at a time but I don't care
which one)...a view per partition value is not practical due to the fairly
high cardinality...

In the case of predicate augmentation, the additional predicate depends on
the value the user is providing e.g. my data is partitioned under
teacherName but the end users don't have this information...So if they ask
for student_id="1234" I'd like to add "teacherName='Smith'" based on a
mapping that is not surfaced to the user (sorry for the contrived
example)...But I don't think I can do this with a view. A join will produce
the right answer but is counter-productive as my goal is to minimize the
partitions being processed.

I can parse the query myself -- I was not fond of this solution as I'd go
sql string to parse tree back to augmented sql string only to have spark
repeat the first part of the exercise....but will do if need be. And yes,
I'd have to be able to process sub-queries too...

On Thu, Nov 5, 2015 at 5:50 PM, Jörn Franke <jo...@gmail.com> wrote:

> Would it be possible to use views to address some of your requirements?
>
> Alternatively it might be better to parse it yourself. There are open
> source libraries for it, if you need really a complete sql parser. Do you
> want to do it on sub queries?
>
> On 05 Nov 2015, at 23:34, Yana Kadiyska <ya...@gmail.com> wrote:
>
> Hi folks, not sure if this belongs to dev or user list..sending to dev as
> it seems a bit convoluted.
>
> I have a UI in which we allow users to write ad-hoc queries against a
> (very large, partitioned) table. I would like to analyze the queries prior
> to execution for two purposes:
>
> 1. Reject under-constrained queries (i.e. there is a field predicate that
> I want to make sure is always present)
> 2. Augment the query with additional predicates (e.g if the user asks for
> a student_id I also want to push a constraint on another field)
>
> I could parse the sql string before passing to spark but obviously spark
> already does this anyway. Can someone give me general direction on how to
> do this (if possible).
>
> Something like
>
> myDF = sql("user_sql_query")
> myDF.queryExecution.logical  //here examine the filters provided by user,
> reject if underconstrained, push new filters as needed (via
> withNewChildren?)
>
> at this point with some luck I'd have a new LogicalPlan -- what is the
> proper way to create an execution plan on top of this new Plan? Im looking
> at this
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
> but this method is restricted to the package. I'd really prefer to hook
> into this as early as possible and still let spark run the plan
> optimizations as usual.
>
> Any guidance or pointers much appreciated.
>
>

Re: Need advice on hooking into Sql query plan

Posted by Jörn Franke <jo...@gmail.com>.
Would it be possible to use views to address some of your requirements?

Alternatively it might be better to parse it yourself. There are open source libraries for it, if you need really a complete sql parser. Do you want to do it on sub queries?

> On 05 Nov 2015, at 23:34, Yana Kadiyska <ya...@gmail.com> wrote:
> 
> Hi folks, not sure if this belongs to dev or user list..sending to dev as it seems a bit convoluted.
> 
> I have a UI in which we allow users to write ad-hoc queries against a (very large, partitioned) table. I would like to analyze the queries prior to execution for two purposes:
> 
> 1. Reject under-constrained queries (i.e. there is a field predicate that I want to make sure is always present)
> 2. Augment the query with additional predicates (e.g if the user asks for a student_id I also want to push a constraint on another field)
> 
> I could parse the sql string before passing to spark but obviously spark already does this anyway. Can someone give me general direction on how to do this (if possible).
> 
> Something like
> 
> myDF = sql("user_sql_query")
> myDF.queryExecution.logical  //here examine the filters provided by user, reject if underconstrained, push new filters as needed (via withNewChildren?)
>  
> at this point with some luck I'd have a new LogicalPlan -- what is the proper way to create an execution plan on top of this new Plan? Im looking at this https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329 but this method is restricted to the package. I'd really prefer to hook into this as early as possible and still let spark run the plan optimizations as usual.
> 
> Any guidance or pointers much appreciated.