You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Vineet Garg <vg...@hortonworks.com> on 2016/09/22 00:11:51 UTC

Subquery de-correlation

Hello Julian/Calcite community,

I am working on adding subquery support in HIVE using calcite.  From what I have read/understood so far Calcite requires HIVE to create RexSubqueryNode corresponding to a subquery and then call SubQueryRemoveRule to get rid of RexSubqueryNode and change it to join. This seems to be working for Un-correlated queries where SubQueryRemoveRule creates Aggregate + Join to get rid of RexSubqueryNode. But I am running into following issues with Co-rrelated queries: (Note that I am using FILTER rule)

  *   Looking at SubQueryRemoveRule code it should be creating Correlate node if it finds any correlation in given filter. To find if given filter has correlation getVariablesSet is called on filter, which supposedly should be returning set of correlated variables, but it is always returning empty set as filter does not implement this method. Shouldn’t Filter implement this method to return appropriate correlated variables ?
  *   Comments in SubQueryRemoveRule mentions that “The correlate can be removed using RelDecorrelator”. But I don’t see SubqueryRemoveRule using RelDecorrelator to de-correlate given query. Should SubQueryRemoveRule call this ? If not is doing de-correlation immediately after SubQueryRemoveRule appropriate ?

Here is what I have done so far for co-rrelated queries. Could you please comment if this is right ?

  *   While creating RexSubqueryNode and RelNode for the subquery I am creating RexCorrelVariable. RexCorrelVariable needs a correlation id. CorrelationId requires an integer id. Should this id be same as index of co-relatted column in outer table ?
  *   Hive has a HiveFilter which is extended from Filter. I implemented getVariableSet method to look at the condition and return all correlated variables in condition’s RelNode. Does this sound correct ?
  *   I am calling RelDecorrelator’s decorrelateQuery immediately after calling SubQueryRemoveRule.  After implementing getVariableSet in HiveFilter SubQueryRemoveRule seems to be creating appropriate LogicalCorrelate for correlate queries but decorrelateQuery is throwing an exception.

Thanks,
Vineet G

Re: Subquery de-correlation

Posted by Vineet Garg <vg...@hortonworks.com>.
Never mind I figured it out by looking at Calcite tests :)




On 9/22/16, 9:26 PM, "Vineet Garg" <vg...@hortonworks.com> wrote:

>Hi Julian,
>
>Thank you for your response. I have few follow-up questions:
>
>Yes. Remember it should return only the correlating variables it sets, not those it inherits
>What do you mean by inherit ? Could you kindly provide an example to elaborate?
>
>No it shouldn’t necessarily. The id must be unique within the whole query.
>If id is unique how does co-related variable in inner query is bound to outer query ? I.e. How would calcite figure out what variable in outer query a particular co-related variable refers to ?
>
>Vineet
>
>From: Julian Hyde <jh...@apache.org>>
>Date: Thursday, September 22, 2016 at 3:05 PM
>To: default <vg...@hortonworks.com>>
>Cc: "dev@calcite.apache.org<ma...@calcite.apache.org>" <de...@calcite.apache.org>>
>Subject: Re: Subquery de-correlation
>
>Vineet,
>
>Thanks for your message. See my responses inline.
>
>On Sep 21, 2016, at 5:11 PM, Vineet Garg <vg...@hortonworks.com>> wrote:
>
>Hello Julian/Calcite community,
>
>I am working on adding subquery support in HIVE using calcite.  From what I have read/understood so far Calcite requires HIVE to create RexSubqueryNode corresponding to a subquery and then call SubQueryRemoveRule to get rid of RexSubqueryNode and change it to join. This seems to be working for Un-correlated queries where SubQueryRemoveRule creates Aggregate + Join to get rid of RexSubqueryNode. But I am running into following issues with Co-rrelated queries: (Note that I am using FILTER rule)
>
>  *   Looking at SubQueryRemoveRule code it should be creating Correlate node if it finds any correlation in given filter. To find if given filter has correlation getVariablesSet is called on filter, which supposedly should be returning set of correlated variables, but it is always returning empty set as filter does not implement this method. Shouldn’t Filter implement this method to return appropriate correlated variables ?
>
>Yes. Remember it should return only the correlating variables it sets, not those it inherits.
>
>  *   Comments in SubQueryRemoveRule mentions that “The correlate can be removed using RelDecorrelator”. But I don’t see SubqueryRemoveRule using RelDecorrelator to de-correlate given query. Should SubQueryRemoveRule call this ? If not is doing de-correlation immediately after SubQueryRemoveRule appropriate ?
>
>I would tend to invoke RelDecorrelator on the whole tree. But I see no reason in principle why it can’t be called on a section of the tree, as long as that section is self-contained (i.e. no unbound correlating variables).
>
>Here is what I have done so far for co-rrelated queries. Could you please comment if this is right ?
>
>  *   While creating RexSubqueryNode and RelNode for the subquery I am creating RexCorrelVariable. RexCorrelVariable needs a correlation id. CorrelationId requires an integer id. Should this id be same as index of co-relatted column in outer table ?
>
>No it shouldn’t necessarily. The id must be unique within the whole query.
>
>  *   Hive has a HiveFilter which is extended from Filter. I implemented getVariableSet method to look at the condition and return all correlated variables in condition’s RelNode. Does this sound correct ?
>
>Yes, sounds right.
>
>  *   I am calling RelDecorrelator’s decorrelateQuery immediately after calling SubQueryRemoveRule.  After implementing getVariableSet in HiveFilter SubQueryRemoveRule seems to be creating appropriate LogicalCorrelate for correlate queries but decorrelateQuery is throwing an exception.
>
>I can’t help too much if you are getting errors in Hive-land. This stuff is so complicated I strongly suggest unit tests. Don’t do anything “new” in Hive, make sure that it all works on Calcite logical nodes. Write tests in RelOptRulesTest.
>
>Julian
>

Re: Subquery de-correlation

Posted by Vineet Garg <vg...@hortonworks.com>.
Hi Julian,

Thank you for your response. I have few follow-up questions:

Yes. Remember it should return only the correlating variables it sets, not those it inherits
What do you mean by inherit ? Could you kindly provide an example to elaborate?

No it shouldn’t necessarily. The id must be unique within the whole query.
If id is unique how does co-related variable in inner query is bound to outer query ? I.e. How would calcite figure out what variable in outer query a particular co-related variable refers to ?

Vineet

From: Julian Hyde <jh...@apache.org>>
Date: Thursday, September 22, 2016 at 3:05 PM
To: default <vg...@hortonworks.com>>
Cc: "dev@calcite.apache.org<ma...@calcite.apache.org>" <de...@calcite.apache.org>>
Subject: Re: Subquery de-correlation

Vineet,

Thanks for your message. See my responses inline.

On Sep 21, 2016, at 5:11 PM, Vineet Garg <vg...@hortonworks.com>> wrote:

Hello Julian/Calcite community,

I am working on adding subquery support in HIVE using calcite.  From what I have read/understood so far Calcite requires HIVE to create RexSubqueryNode corresponding to a subquery and then call SubQueryRemoveRule to get rid of RexSubqueryNode and change it to join. This seems to be working for Un-correlated queries where SubQueryRemoveRule creates Aggregate + Join to get rid of RexSubqueryNode. But I am running into following issues with Co-rrelated queries: (Note that I am using FILTER rule)

  *   Looking at SubQueryRemoveRule code it should be creating Correlate node if it finds any correlation in given filter. To find if given filter has correlation getVariablesSet is called on filter, which supposedly should be returning set of correlated variables, but it is always returning empty set as filter does not implement this method. Shouldn’t Filter implement this method to return appropriate correlated variables ?

Yes. Remember it should return only the correlating variables it sets, not those it inherits.

  *   Comments in SubQueryRemoveRule mentions that “The correlate can be removed using RelDecorrelator”. But I don’t see SubqueryRemoveRule using RelDecorrelator to de-correlate given query. Should SubQueryRemoveRule call this ? If not is doing de-correlation immediately after SubQueryRemoveRule appropriate ?

I would tend to invoke RelDecorrelator on the whole tree. But I see no reason in principle why it can’t be called on a section of the tree, as long as that section is self-contained (i.e. no unbound correlating variables).

Here is what I have done so far for co-rrelated queries. Could you please comment if this is right ?

  *   While creating RexSubqueryNode and RelNode for the subquery I am creating RexCorrelVariable. RexCorrelVariable needs a correlation id. CorrelationId requires an integer id. Should this id be same as index of co-relatted column in outer table ?

No it shouldn’t necessarily. The id must be unique within the whole query.

  *   Hive has a HiveFilter which is extended from Filter. I implemented getVariableSet method to look at the condition and return all correlated variables in condition’s RelNode. Does this sound correct ?

Yes, sounds right.

  *   I am calling RelDecorrelator’s decorrelateQuery immediately after calling SubQueryRemoveRule.  After implementing getVariableSet in HiveFilter SubQueryRemoveRule seems to be creating appropriate LogicalCorrelate for correlate queries but decorrelateQuery is throwing an exception.

I can’t help too much if you are getting errors in Hive-land. This stuff is so complicated I strongly suggest unit tests. Don’t do anything “new” in Hive, make sure that it all works on Calcite logical nodes. Write tests in RelOptRulesTest.

Julian


Re: Subquery de-correlation

Posted by Julian Hyde <jh...@apache.org>.
Vineet,

Thanks for your message. See my responses inline.

> On Sep 21, 2016, at 5:11 PM, Vineet Garg <vg...@hortonworks.com> wrote:
> 
> Hello Julian/Calcite community,
> 
> I am working on adding subquery support in HIVE using calcite.  From what I have read/understood so far Calcite requires HIVE to create RexSubqueryNode corresponding to a subquery and then call SubQueryRemoveRule to get rid of RexSubqueryNode and change it to join. This seems to be working for Un-correlated queries where SubQueryRemoveRule creates Aggregate + Join to get rid of RexSubqueryNode. But I am running into following issues with Co-rrelated queries: (Note that I am using FILTER rule)
> Looking at SubQueryRemoveRule code it should be creating Correlate node if it finds any correlation in given filter. To find if given filter has correlation getVariablesSet is called on filter, which supposedly should be returning set of correlated variables, but it is always returning empty set as filter does not implement this method. Shouldn’t Filter implement this method to return appropriate correlated variables ?
Yes. Remember it should return only the correlating variables it sets, not those it inherits.
> Comments in SubQueryRemoveRule mentions that “The correlate can be removed using RelDecorrelator”. But I don’t see SubqueryRemoveRule using RelDecorrelator to de-correlate given query. Should SubQueryRemoveRule call this ? If not is doing de-correlation immediately after SubQueryRemoveRule appropriate ? 
I would tend to invoke RelDecorrelator on the whole tree. But I see no reason in principle why it can’t be called on a section of the tree, as long as that section is self-contained (i.e. no unbound correlating variables).

> Here is what I have done so far for co-rrelated queries. Could you please comment if this is right ?
> While creating RexSubqueryNode and RelNode for the subquery I am creating RexCorrelVariable. RexCorrelVariable needs a correlation id. CorrelationId requires an integer id. Should this id be same as index of co-relatted column in outer table ? 
No it shouldn’t necessarily. The id must be unique within the whole query.
> Hive has a HiveFilter which is extended from Filter. I implemented getVariableSet method to look at the condition and return all correlated variables in condition’s RelNode. Does this sound correct ? 
Yes, sounds right.
> I am calling RelDecorrelator’s decorrelateQuery immediately after calling SubQueryRemoveRule.  After implementing getVariableSet in HiveFilter SubQueryRemoveRule seems to be creating appropriate LogicalCorrelate for correlate queries but decorrelateQuery is throwing an exception.
I can’t help too much if you are getting errors in Hive-land. This stuff is so complicated I strongly suggest unit tests. Don’t do anything “new” in Hive, make sure that it all works on Calcite logical nodes. Write tests in RelOptRulesTest.

Julian