You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Ted Xu <fr...@gmail.com> on 2018/10/23 03:26:15 UTC

'Spool' Node support

Hi folks,

I'm not sure if there is a recommended way to represent diverged (multiple
parents) plan in Calcite. It’s true that RelNode data structure is
compatible with multiple parents, but it is not working in optimizer.

For example, if we have query as follows,

FROM (SELECT c1, random() as c2, c3 FROM src)
INSERT OVERWRITE TABLE src1 SELECT c1, c2
INSERT OVERWRITE TABLE src2 SELECT c3, c2

TableSink1(on columns c1, c2)
    Project(c1, random() as c2, c3)
        TableScan
TableSink2(on columns c3, c2)
    Project(c1, random() as c2, c3)
        TableScan

Planners will recognize Projects and TableScans share the common digests
thus merged together, but Project Transpose Rules splits them, which breaks
the random assumption.

My solution is to add a Spool node to prevent any rule to further split a
sub-plan, but it generates sub-optimal result. I've noticed there is a
really old JIRA ticket https://jira.apache.org/jira/browse/CALCITE-481 but
it was somehow suspended.

I'd like to move on on this feature, but there are still something to do
first:

1. Let RelOptRuleCall to aware parents, currently only HepRelOptRuleCall
passes parents in certain cases.
2. Let RelOptRuleOperand to define multiple parent patterns

Please correct me if I'm something wrong, any suggestion will be much
appreciated.

Re: 'Spool' Node support

Posted by Julian Hyde <jh...@gmail.com>.
I get the impression you’re missing something important about how VolcanoPlanner works - equivalence sets. You speak of detecting duplicates by looking at RelNodes’ digests, whereas you should look at whether they are in the same equivalence set. 

And you also suggest that the previous plan is thrown away, and in VolcanoPlanner that is not the case. 

To take your project-project example. After merging projects there will be a choice of inputs - the combined project on top of a scan, and a project on top of the spool. They are equivalent. The project on top of the spool is likely the best plan - because it shares work with the other node that uses the spool - but it’s possible that the combined project prunes columns so aggressively that it’s better. 

So yes, Spool node would prevent plans diverging, but it’s really equivalence sets that are doing the work. Spool mainly helps with accounting.

Julian

> On Oct 22, 2018, at 8:46 PM, Ted Xu <fr...@gmail.com> wrote:
> 
> Thanks Julian for the comment.
> 
> To some extent it is "split" in VolcanoPlanner. Let's revisit the example I
> mentioned above:
> 
> TableSink1(on columns c1, c2)
>    Project(c1, random() as c2, c3)
>        TableScan
> TableSink2(on columns c3, c2)
>    Project(c1, random() as c2, c3)
>        TableScan
> 
> The 2 projects share the common digests, thus it is recognized in
> VolcanoPlanner, the plan in memo looks like:
> 
> TableSink1     TableSink2
>         \                 /
>             Project (c1, random() as c2, c3)
>                 |
>            TableScan
> 
> However, if we "add" a project here, the project will further match other
> rule pattern (e.g., PrjectMergeRule), the random project will no longer
> share the same digest, new plan looks like:
> 
> TableSink1              TableSink2
>      |                               |
> Project (c1, random())   Project(c3, random())
>      |                               |
> TableScan(c1)         TableScan(c3)
> 
> Notice that the random function call is executed twice, which breaks the
> assumption of original sql query.
> 
> I think Spool node should prevent such kind of transformation.
> 
>> On Tue, Oct 23, 2018 at 11:30 AM Julian Hyde <jh...@apache.org> wrote:
>> 
>> I assume you’re talking about HepPlanner? VolcanoPlanner doesn’t “split”
>> anything, it only adds new things.
>> 
>> As you’ve noticed Spool isn’t finished, but the idea would be to use
>> VolcanoPlanner, because it can truly handle plans that are DAGs, then use
>> some kind of costing trick to ensure that nodes that are shared are only
>> counted in the overall cost once.
>> 
>>> On Oct 22, 2018, at 8:26 PM, Ted Xu <fr...@gmail.com> wrote:
>>> 
>>> Hi folks,
>>> 
>>> I'm not sure if there is a recommended way to represent diverged
>> (multiple
>>> parents) plan in Calcite. It’s true that RelNode data structure is
>>> compatible with multiple parents, but it is not working in optimizer.
>>> 
>>> For example, if we have query as follows,
>>> 
>>> FROM (SELECT c1, random() as c2, c3 FROM src)
>>> INSERT OVERWRITE TABLE src1 SELECT c1, c2
>>> INSERT OVERWRITE TABLE src2 SELECT c3, c2
>>> 
>>> TableSink1(on columns c1, c2)
>>>   Project(c1, random() as c2, c3)
>>>       TableScan
>>> TableSink2(on columns c3, c2)
>>>   Project(c1, random() as c2, c3)
>>>       TableScan
>>> 
>>> Planners will recognize Projects and TableScans share the common digests
>>> thus merged together, but Project Transpose Rules splits them, which
>> breaks
>>> the random assumption.
>>> 
>>> My solution is to add a Spool node to prevent any rule to further split a
>>> sub-plan, but it generates sub-optimal result. I've noticed there is a
>>> really old JIRA ticket https://jira.apache.org/jira/browse/CALCITE-481
>> but
>>> it was somehow suspended.
>>> 
>>> I'd like to move on on this feature, but there are still something to do
>>> first:
>>> 
>>> 1. Let RelOptRuleCall to aware parents, currently only HepRelOptRuleCall
>>> passes parents in certain cases.
>>> 2. Let RelOptRuleOperand to define multiple parent patterns
>>> 
>>> Please correct me if I'm something wrong, any suggestion will be much
>>> appreciated.
>> 
>> 

Re: 'Spool' Node support

Posted by Ted Xu <fr...@gmail.com>.
Thanks Julian for the comment.

To some extent it is "split" in VolcanoPlanner. Let's revisit the example I
mentioned above:

TableSink1(on columns c1, c2)
    Project(c1, random() as c2, c3)
        TableScan
TableSink2(on columns c3, c2)
    Project(c1, random() as c2, c3)
        TableScan

The 2 projects share the common digests, thus it is recognized in
VolcanoPlanner, the plan in memo looks like:

TableSink1     TableSink2
         \                 /
             Project (c1, random() as c2, c3)
                 |
            TableScan

However, if we "add" a project here, the project will further match other
rule pattern (e.g., PrjectMergeRule), the random project will no longer
share the same digest, new plan looks like:

TableSink1              TableSink2
      |                               |
Project (c1, random())   Project(c3, random())
      |                               |
TableScan(c1)         TableScan(c3)

Notice that the random function call is executed twice, which breaks the
assumption of original sql query.

I think Spool node should prevent such kind of transformation.

On Tue, Oct 23, 2018 at 11:30 AM Julian Hyde <jh...@apache.org> wrote:

> I assume you’re talking about HepPlanner? VolcanoPlanner doesn’t “split”
> anything, it only adds new things.
>
> As you’ve noticed Spool isn’t finished, but the idea would be to use
> VolcanoPlanner, because it can truly handle plans that are DAGs, then use
> some kind of costing trick to ensure that nodes that are shared are only
> counted in the overall cost once.
>
> > On Oct 22, 2018, at 8:26 PM, Ted Xu <fr...@gmail.com> wrote:
> >
> > Hi folks,
> >
> > I'm not sure if there is a recommended way to represent diverged
> (multiple
> > parents) plan in Calcite. It’s true that RelNode data structure is
> > compatible with multiple parents, but it is not working in optimizer.
> >
> > For example, if we have query as follows,
> >
> > FROM (SELECT c1, random() as c2, c3 FROM src)
> > INSERT OVERWRITE TABLE src1 SELECT c1, c2
> > INSERT OVERWRITE TABLE src2 SELECT c3, c2
> >
> > TableSink1(on columns c1, c2)
> >    Project(c1, random() as c2, c3)
> >        TableScan
> > TableSink2(on columns c3, c2)
> >    Project(c1, random() as c2, c3)
> >        TableScan
> >
> > Planners will recognize Projects and TableScans share the common digests
> > thus merged together, but Project Transpose Rules splits them, which
> breaks
> > the random assumption.
> >
> > My solution is to add a Spool node to prevent any rule to further split a
> > sub-plan, but it generates sub-optimal result. I've noticed there is a
> > really old JIRA ticket https://jira.apache.org/jira/browse/CALCITE-481
> but
> > it was somehow suspended.
> >
> > I'd like to move on on this feature, but there are still something to do
> > first:
> >
> > 1. Let RelOptRuleCall to aware parents, currently only HepRelOptRuleCall
> > passes parents in certain cases.
> > 2. Let RelOptRuleOperand to define multiple parent patterns
> >
> > Please correct me if I'm something wrong, any suggestion will be much
> > appreciated.
>
>

Re: 'Spool' Node support

Posted by Julian Hyde <jh...@apache.org>.
I assume you’re talking about HepPlanner? VolcanoPlanner doesn’t “split” anything, it only adds new things.

As you’ve noticed Spool isn’t finished, but the idea would be to use VolcanoPlanner, because it can truly handle plans that are DAGs, then use some kind of costing trick to ensure that nodes that are shared are only counted in the overall cost once.

> On Oct 22, 2018, at 8:26 PM, Ted Xu <fr...@gmail.com> wrote:
> 
> Hi folks,
> 
> I'm not sure if there is a recommended way to represent diverged (multiple
> parents) plan in Calcite. It’s true that RelNode data structure is
> compatible with multiple parents, but it is not working in optimizer.
> 
> For example, if we have query as follows,
> 
> FROM (SELECT c1, random() as c2, c3 FROM src)
> INSERT OVERWRITE TABLE src1 SELECT c1, c2
> INSERT OVERWRITE TABLE src2 SELECT c3, c2
> 
> TableSink1(on columns c1, c2)
>    Project(c1, random() as c2, c3)
>        TableScan
> TableSink2(on columns c3, c2)
>    Project(c1, random() as c2, c3)
>        TableScan
> 
> Planners will recognize Projects and TableScans share the common digests
> thus merged together, but Project Transpose Rules splits them, which breaks
> the random assumption.
> 
> My solution is to add a Spool node to prevent any rule to further split a
> sub-plan, but it generates sub-optimal result. I've noticed there is a
> really old JIRA ticket https://jira.apache.org/jira/browse/CALCITE-481 but
> it was somehow suspended.
> 
> I'd like to move on on this feature, but there are still something to do
> first:
> 
> 1. Let RelOptRuleCall to aware parents, currently only HepRelOptRuleCall
> passes parents in certain cases.
> 2. Let RelOptRuleOperand to define multiple parent patterns
> 
> Please correct me if I'm something wrong, any suggestion will be much
> appreciated.