You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Chinmay Kolhatkar <ch...@datatorrent.com> on 2016/10/04 16:23:19 UTC

Using VolcanoPlanner to create unified RelNode tree from multiple SQL

Dear Community,

I'm working on integration with Apache Apex and Apache Calcite (underlying
Apache Jira is: APEXMALHAR-1818).

I have a question related to conversion from SQL to RelNode Tree.

Is it possible that VolcanoPlanner can take multiple SQL statements as
input and return a unified RelNode Tree?
Example of above is as follows:

SELECT COL1, COL2 FROM TABLE WHERE COL3 > 10;
SELECT COL1, COL2 FROM TABLE WHERE COL4  = 'abc';

Above 2 statements has a common path and hence can provide an unified
RelNode tree as follows:

 [Scan] -> [Project (COL1, COL2)] -> [Filter (COL4 = 'abc')] -> [Delta]
                    |
                    V
            [Filter (COL3 > 10)]
                    |
                    v
                 [Delta]


Thanks,
Chinmay.

Re: Using VolcanoPlanner to create unified RelNode tree from multiple SQL

Posted by Julian Hyde <jh...@gmail.com>.
Thanks; I'll add comments to the jira case. 

Julian

> On Oct 13, 2016, at 9:39 PM, Chinmay Kolhatkar <ch...@apache.org> wrote:
> 
> Apologies for delay on this.
> I've logged a JIRA for this:
> https://issues.apache.org/jira/browse/CALCITE-1440
> 
> 
>> On Mon, Oct 10, 2016 at 11:13 PM, Julian Hyde <jh...@apache.org> wrote:
>> 
>> It sounds as if your requirements go beyond CALCITE-481. If so, please log
>> a JIRA case.
>> 
>>> On Oct 10, 2016, at 8:52 AM, Chinmay Kolhatkar <ch...@apache.org>
>> wrote:
>>> 
>>> Thanks Julian. I think CALCITE-481 would be a good improvement.
>>> 
>>> On Wed, Oct 5, 2016 at 11:02 PM, Julian Hyde <jhyde.apache@gmail.com
>> <ma...@gmail.com>> wrote:
>>> 
>>>> What you are asking for is not currently possible but would be a very
>>>> interesting, useful and powerful extension to Calcite. And in my
>> opinion it
>>>> fits well with the way with the way that Volcano/Cascades optimizers
>> work.
>>>> The idea would be for the optimizer to work on a forest (a set of trees
>>>> with different roots, but shared leaves).
>>>> 
>>>> It is related to https://issues.apache.org/jira/browse/CALCITE-481,
>> which
>>>> deals with trees that have a single root but store intermediate results
>> so
>>>> that they can be used more than once. I surmise that you are also very
>>>> interested in maximizing the commonality between trees.
>>>> 
>>>> The trickiest part is to get the cost model right. You need to account
>> for
>>>> a cost each time the work is done, not each time the result is used.
>> Jesus
>>>> researched this
>>>> very topic, using integer linear programming (ILP) to incorporate these
>>>> considerations into the cost model.
>>>> 
>>>> Can you please create a JIRA case for it? Maybe Jesus can add some
>>>> comments there.
>>>> 
>>>> It is also related to Hive multi-table queries [1]. I have not seen a
>> SQL
>>>> syntax for multi-output queries. Maybe we can devise one. Or does
>> someone
>>>> know of one?
>>>> 
>>>> And it presents challenges/opportunities for engines. Some engines can
>>>> push to multiple consumers, or equivalently create a stream of tuples
>> that
>>>> can be read by multiple consumers, others can only persist an
>> intermediate
>>>> result that can be read by multiple consumers. Spark had their own
>> version
>>>> of this discussion when implementing multitable in Hive-Spark[2].
>>>> 
>>>> Julian
>>>> 
>>>> [1] https://cwiki.apache.org/confluence/display/Hive/
>>>> GettingStarted#GettingStarted-MULTITABLEINSERT <
>> https://cwiki.apache.org/ <https://cwiki.apache.org/>
>>>> confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT>
>>>> 
>>>> [2] https://issues.apache.org/jira/browse/SPARK-3622 <
>> https://issues.apache.org/jira/browse/SPARK-3622> <
>>>> https://issues.apache.org/jira/browse/SPARK-3622>
>>>> 
>>>>> On Oct 4, 2016, at 9:23 AM, Chinmay Kolhatkar <chinmay@datatorrent.com
>>> 
>>>> wrote:
>>>>> 
>>>>> Dear Community,
>>>>> 
>>>>> I'm working on integration with Apache Apex and Apache Calcite
>>>> (underlying
>>>>> Apache Jira is: APEXMALHAR-1818).
>>>>> 
>>>>> I have a question related to conversion from SQL to RelNode Tree.
>>>>> 
>>>>> Is it possible that VolcanoPlanner can take multiple SQL statements as
>>>>> input and return a unified RelNode Tree?
>>>>> Example of above is as follows:
>>>>> 
>>>>> SELECT COL1, COL2 FROM TABLE WHERE COL3 > 10;
>>>>> SELECT COL1, COL2 FROM TABLE WHERE COL4  = 'abc';
>>>>> 
>>>>> Above 2 statements has a common path and hence can provide an unified
>>>>> RelNode tree as follows:
>>>>> 
>>>>> [Scan] -> [Project (COL1, COL2)] -> [Filter (COL4 = 'abc')] -> [Delta]
>>>>>                  |
>>>>>                  V
>>>>>          [Filter (COL3 > 10)]
>>>>>                  |
>>>>>                  v
>>>>>               [Delta]
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Chinmay.
>> 
>> 

Re: Using VolcanoPlanner to create unified RelNode tree from multiple SQL

Posted by Chinmay Kolhatkar <ch...@apache.org>.
Apologies for delay on this.
I've logged a JIRA for this:
https://issues.apache.org/jira/browse/CALCITE-1440


On Mon, Oct 10, 2016 at 11:13 PM, Julian Hyde <jh...@apache.org> wrote:

> It sounds as if your requirements go beyond CALCITE-481. If so, please log
> a JIRA case.
>
> > On Oct 10, 2016, at 8:52 AM, Chinmay Kolhatkar <ch...@apache.org>
> wrote:
> >
> > Thanks Julian. I think CALCITE-481 would be a good improvement.
> >
> > On Wed, Oct 5, 2016 at 11:02 PM, Julian Hyde <jhyde.apache@gmail.com
> <ma...@gmail.com>> wrote:
> >
> >> What you are asking for is not currently possible but would be a very
> >> interesting, useful and powerful extension to Calcite. And in my
> opinion it
> >> fits well with the way with the way that Volcano/Cascades optimizers
> work.
> >> The idea would be for the optimizer to work on a forest (a set of trees
> >> with different roots, but shared leaves).
> >>
> >> It is related to https://issues.apache.org/jira/browse/CALCITE-481,
> which
> >> deals with trees that have a single root but store intermediate results
> so
> >> that they can be used more than once. I surmise that you are also very
> >> interested in maximizing the commonality between trees.
> >>
> >> The trickiest part is to get the cost model right. You need to account
> for
> >> a cost each time the work is done, not each time the result is used.
> Jesus
> >> researched this
> >> very topic, using integer linear programming (ILP) to incorporate these
> >> considerations into the cost model.
> >>
> >> Can you please create a JIRA case for it? Maybe Jesus can add some
> >> comments there.
> >>
> >> It is also related to Hive multi-table queries [1]. I have not seen a
> SQL
> >> syntax for multi-output queries. Maybe we can devise one. Or does
> someone
> >> know of one?
> >>
> >> And it presents challenges/opportunities for engines. Some engines can
> >> push to multiple consumers, or equivalently create a stream of tuples
> that
> >> can be read by multiple consumers, others can only persist an
> intermediate
> >> result that can be read by multiple consumers. Spark had their own
> version
> >> of this discussion when implementing multitable in Hive-Spark[2].
> >>
> >> Julian
> >>
> >> [1] https://cwiki.apache.org/confluence/display/Hive/
> >> GettingStarted#GettingStarted-MULTITABLEINSERT <
> https://cwiki.apache.org/ <https://cwiki.apache.org/>
> >> confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT>
> >>
> >> [2] https://issues.apache.org/jira/browse/SPARK-3622 <
> https://issues.apache.org/jira/browse/SPARK-3622> <
> >> https://issues.apache.org/jira/browse/SPARK-3622>
> >>
> >>> On Oct 4, 2016, at 9:23 AM, Chinmay Kolhatkar <chinmay@datatorrent.com
> >
> >> wrote:
> >>>
> >>> Dear Community,
> >>>
> >>> I'm working on integration with Apache Apex and Apache Calcite
> >> (underlying
> >>> Apache Jira is: APEXMALHAR-1818).
> >>>
> >>> I have a question related to conversion from SQL to RelNode Tree.
> >>>
> >>> Is it possible that VolcanoPlanner can take multiple SQL statements as
> >>> input and return a unified RelNode Tree?
> >>> Example of above is as follows:
> >>>
> >>> SELECT COL1, COL2 FROM TABLE WHERE COL3 > 10;
> >>> SELECT COL1, COL2 FROM TABLE WHERE COL4  = 'abc';
> >>>
> >>> Above 2 statements has a common path and hence can provide an unified
> >>> RelNode tree as follows:
> >>>
> >>> [Scan] -> [Project (COL1, COL2)] -> [Filter (COL4 = 'abc')] -> [Delta]
> >>>                   |
> >>>                   V
> >>>           [Filter (COL3 > 10)]
> >>>                   |
> >>>                   v
> >>>                [Delta]
> >>>
> >>>
> >>> Thanks,
> >>> Chinmay.
>
>

Re: Using VolcanoPlanner to create unified RelNode tree from multiple SQL

Posted by Julian Hyde <jh...@apache.org>.
It sounds as if your requirements go beyond CALCITE-481. If so, please log a JIRA case.

> On Oct 10, 2016, at 8:52 AM, Chinmay Kolhatkar <ch...@apache.org> wrote:
> 
> Thanks Julian. I think CALCITE-481 would be a good improvement.
> 
> On Wed, Oct 5, 2016 at 11:02 PM, Julian Hyde <jhyde.apache@gmail.com <ma...@gmail.com>> wrote:
> 
>> What you are asking for is not currently possible but would be a very
>> interesting, useful and powerful extension to Calcite. And in my opinion it
>> fits well with the way with the way that Volcano/Cascades optimizers work.
>> The idea would be for the optimizer to work on a forest (a set of trees
>> with different roots, but shared leaves).
>> 
>> It is related to https://issues.apache.org/jira/browse/CALCITE-481, which
>> deals with trees that have a single root but store intermediate results so
>> that they can be used more than once. I surmise that you are also very
>> interested in maximizing the commonality between trees.
>> 
>> The trickiest part is to get the cost model right. You need to account for
>> a cost each time the work is done, not each time the result is used. Jesus
>> researched this
>> very topic, using integer linear programming (ILP) to incorporate these
>> considerations into the cost model.
>> 
>> Can you please create a JIRA case for it? Maybe Jesus can add some
>> comments there.
>> 
>> It is also related to Hive multi-table queries [1]. I have not seen a SQL
>> syntax for multi-output queries. Maybe we can devise one. Or does someone
>> know of one?
>> 
>> And it presents challenges/opportunities for engines. Some engines can
>> push to multiple consumers, or equivalently create a stream of tuples that
>> can be read by multiple consumers, others can only persist an intermediate
>> result that can be read by multiple consumers. Spark had their own version
>> of this discussion when implementing multitable in Hive-Spark[2].
>> 
>> Julian
>> 
>> [1] https://cwiki.apache.org/confluence/display/Hive/
>> GettingStarted#GettingStarted-MULTITABLEINSERT <https://cwiki.apache.org/ <https://cwiki.apache.org/>
>> confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT>
>> 
>> [2] https://issues.apache.org/jira/browse/SPARK-3622 <https://issues.apache.org/jira/browse/SPARK-3622> <
>> https://issues.apache.org/jira/browse/SPARK-3622>
>> 
>>> On Oct 4, 2016, at 9:23 AM, Chinmay Kolhatkar <ch...@datatorrent.com>
>> wrote:
>>> 
>>> Dear Community,
>>> 
>>> I'm working on integration with Apache Apex and Apache Calcite
>> (underlying
>>> Apache Jira is: APEXMALHAR-1818).
>>> 
>>> I have a question related to conversion from SQL to RelNode Tree.
>>> 
>>> Is it possible that VolcanoPlanner can take multiple SQL statements as
>>> input and return a unified RelNode Tree?
>>> Example of above is as follows:
>>> 
>>> SELECT COL1, COL2 FROM TABLE WHERE COL3 > 10;
>>> SELECT COL1, COL2 FROM TABLE WHERE COL4  = 'abc';
>>> 
>>> Above 2 statements has a common path and hence can provide an unified
>>> RelNode tree as follows:
>>> 
>>> [Scan] -> [Project (COL1, COL2)] -> [Filter (COL4 = 'abc')] -> [Delta]
>>>                   |
>>>                   V
>>>           [Filter (COL3 > 10)]
>>>                   |
>>>                   v
>>>                [Delta]
>>> 
>>> 
>>> Thanks,
>>> Chinmay.


Re: Using VolcanoPlanner to create unified RelNode tree from multiple SQL

Posted by Chinmay Kolhatkar <ch...@apache.org>.
Thanks Julian. I think CALCITE-481 would be a good improvement.

On Wed, Oct 5, 2016 at 11:02 PM, Julian Hyde <jh...@gmail.com> wrote:

> What you are asking for is not currently possible but would be a very
> interesting, useful and powerful extension to Calcite. And in my opinion it
> fits well with the way with the way that Volcano/Cascades optimizers work.
> The idea would be for the optimizer to work on a forest (a set of trees
> with different roots, but shared leaves).
>
> It is related to https://issues.apache.org/jira/browse/CALCITE-481, which
> deals with trees that have a single root but store intermediate results so
> that they can be used more than once. I surmise that you are also very
> interested in maximizing the commonality between trees.
>
> The trickiest part is to get the cost model right. You need to account for
> a cost each time the work is done, not each time the result is used. Jesus
> researched this
> very topic, using integer linear programming (ILP) to incorporate these
> considerations into the cost model.
>
> Can you please create a JIRA case for it? Maybe Jesus can add some
> comments there.
>
> It is also related to Hive multi-table queries [1]. I have not seen a SQL
> syntax for multi-output queries. Maybe we can devise one. Or does someone
> know of one?
>
> And it presents challenges/opportunities for engines. Some engines can
> push to multiple consumers, or equivalently create a stream of tuples that
> can be read by multiple consumers, others can only persist an intermediate
> result that can be read by multiple consumers. Spark had their own version
> of this discussion when implementing multitable in Hive-Spark[2].
>
> Julian
>
> [1] https://cwiki.apache.org/confluence/display/Hive/
> GettingStarted#GettingStarted-MULTITABLEINSERT <https://cwiki.apache.org/
> confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT>
>
> [2] https://issues.apache.org/jira/browse/SPARK-3622 <
> https://issues.apache.org/jira/browse/SPARK-3622>
>
> > On Oct 4, 2016, at 9:23 AM, Chinmay Kolhatkar <ch...@datatorrent.com>
> wrote:
> >
> > Dear Community,
> >
> > I'm working on integration with Apache Apex and Apache Calcite
> (underlying
> > Apache Jira is: APEXMALHAR-1818).
> >
> > I have a question related to conversion from SQL to RelNode Tree.
> >
> > Is it possible that VolcanoPlanner can take multiple SQL statements as
> > input and return a unified RelNode Tree?
> > Example of above is as follows:
> >
> > SELECT COL1, COL2 FROM TABLE WHERE COL3 > 10;
> > SELECT COL1, COL2 FROM TABLE WHERE COL4  = 'abc';
> >
> > Above 2 statements has a common path and hence can provide an unified
> > RelNode tree as follows:
> >
> > [Scan] -> [Project (COL1, COL2)] -> [Filter (COL4 = 'abc')] -> [Delta]
> >                    |
> >                    V
> >            [Filter (COL3 > 10)]
> >                    |
> >                    v
> >                 [Delta]
> >
> >
> > Thanks,
> > Chinmay.
>
>

Re: Using VolcanoPlanner to create unified RelNode tree from multiple SQL

Posted by Julian Hyde <jh...@gmail.com>.
What you are asking for is not currently possible but would be a very interesting, useful and powerful extension to Calcite. And in my opinion it fits well with the way with the way that Volcano/Cascades optimizers work. The idea would be for the optimizer to work on a forest (a set of trees with different roots, but shared leaves).

It is related to https://issues.apache.org/jira/browse/CALCITE-481, which deals with trees that have a single root but store intermediate results so that they can be used more than once. I surmise that you are also very interested in maximizing the commonality between trees.

The trickiest part is to get the cost model right. You need to account for a cost each time the work is done, not each time the result is used. Jesus researched this 
very topic, using integer linear programming (ILP) to incorporate these considerations into the cost model.

Can you please create a JIRA case for it? Maybe Jesus can add some comments there.

It is also related to Hive multi-table queries [1]. I have not seen a SQL syntax for multi-output queries. Maybe we can devise one. Or does someone know of one?

And it presents challenges/opportunities for engines. Some engines can push to multiple consumers, or equivalently create a stream of tuples that can be read by multiple consumers, others can only persist an intermediate result that can be read by multiple consumers. Spark had their own version of this discussion when implementing multitable in Hive-Spark[2].

Julian

[1] https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT <https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT> 

[2] https://issues.apache.org/jira/browse/SPARK-3622 <https://issues.apache.org/jira/browse/SPARK-3622> 

> On Oct 4, 2016, at 9:23 AM, Chinmay Kolhatkar <ch...@datatorrent.com> wrote:
> 
> Dear Community,
> 
> I'm working on integration with Apache Apex and Apache Calcite (underlying
> Apache Jira is: APEXMALHAR-1818).
> 
> I have a question related to conversion from SQL to RelNode Tree.
> 
> Is it possible that VolcanoPlanner can take multiple SQL statements as
> input and return a unified RelNode Tree?
> Example of above is as follows:
> 
> SELECT COL1, COL2 FROM TABLE WHERE COL3 > 10;
> SELECT COL1, COL2 FROM TABLE WHERE COL4  = 'abc';
> 
> Above 2 statements has a common path and hence can provide an unified
> RelNode tree as follows:
> 
> [Scan] -> [Project (COL1, COL2)] -> [Filter (COL4 = 'abc')] -> [Delta]
>                    |
>                    V
>            [Filter (COL3 > 10)]
>                    |
>                    v
>                 [Delta]
> 
> 
> Thanks,
> Chinmay.