You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Vladimir Ozerov <pp...@gmail.com> on 2022/02/10 16:50:43 UTC

Allow Cascades driver invoking "derive" on the nodes produced by "passThrough"

Hi,

In the Cascades driver, it is possible to propagate the requests top-down
using the "passThrough", method and then notify parents bottom-up about the
concrete physical implementations of inputs using the "derive" method.

In some optimizers, the valid parent node cannot be created before the
trait sets of inputs are known. An example is a custom distribution trait
that includes the number of shards in the system. The parent operator alone
may guess the distribution keys, but cannot know the number of input
shards. To mitigate this, you may create a "template" node with an infinite
cost from within the optimization rule that will propagate the
passThrough/drive calls but would never participate in the final plan.

Currency, the top-down driver designed in a way that the nodes created from
the "passThrough" method are not notified on the "derive" stage. This leads
to the incomplete exploration of the search space. For example, the rule
may produce the node "A1.template" that will be converted into a normal
"A1" node in the derive phase. However, if the parent operator produced
"A2.template" from "A1.template" using pass-through mechanics, the
"A2.template" will never be notified about the concrete input traits,
possibly losing the optimal plan. This is especially painful in distributed
engines, where the number of shards is important for the placement of
Shuffle operators.

It seems that the problem could be solved with relatively low effort. The
"derive" is not invoked on the nodes created from the "passThrough" method,
because such nodes are placed in the "passThroughCache" collection. Instead
of doing this unconditionally, we may introduce an additional predicate
that would selectively enforce "derive" on such nodes. For example, this
could be a default method in the PhysicalNode interface, like:

interface PhysicalNode {
  default boolean enforceDerive() { return false; }
}

If there are no objections, I'll proceed with this change.

Alternatively, we may make the TopDownRuleDriver more "public", so that the
user can extend it and decide within the driver whether to cache a
particular node or not.

I would appreciate your feedback on the matter.

Regards,
Vladimir.

Re: Allow Cascades driver invoking "derive" on the nodes produced by "passThrough"

Posted by Haisheng Yuan <hy...@apache.org>.
Hi Vladimir,

If I understand it correctly, your concern is still about [1], right?
If it is, I think I have answered it in [2], perhaps it didn't jump into your inbox.

I am still not convinced that enforcing derivation on all rels is the only way to solve your concern.

If it is the other problem, can you give us a concrete example? So we can discuss and solve it.

Thanks,
Haisheng Yuan

[1] https://lists.apache.org/thread/4rcvlk1oprbnbgbnwl2s735p99h4vj80
[2] https://lists.apache.org/thread/s17tsj3pmccfrfvydnmv76btc3voxohj

On 2022/02/13 09:40:05 Roman Kondakov wrote:
> Hi Alessandro,
> 
> this problem was already discussed on dev-list [1] and we have a ticket 
> for this [2].
> 
> My concern is that many projects use Calcite as a Lego kit: they took 
> internal components of Calcite and combine them for building a custom 
> planning and execution pipeline. And sometimes downstream projects need 
> to change the default behavior of internal components to fit their 
> requirements or overcome the bug. So the idea of keeping even internal 
> components of Calcite "more public" is rather a good thing than the bad 
> one from my point of view.
> 
> Thank you.
> 
> [1] https://lists.apache.org/thread/cykl74dcphgow4790fwoc8frsjglz7n1
> 
> [2] https://issues.apache.org/jira/browse/CALCITE-4542
> 
> --
> Roman Kondakov
> 
> 
> On 11.02.2022 19:15, Alessandro Solimando wrote:
> > Hello everyone,
> > @Vladimir, +1 on the change introducing "enforceDerive()".
> >
> > @Roman, could you walk us through the limitations you found that forced you
> > to copy-paste the whole class?
> >
> > Maybe there is some middle ground for your problem(s) too, similar in
> > spirit to what Vladimir proposed for the other limitation.
> >
> > I am not against making the class more public if necessary, but it would be
> > nice to have a discussion here before going down that path.
> > If the discussion leads to a better design of the original class, all
> > projects would benefit from that.
> >
> > Best regards,
> > Alessandro
> >
> > On Fri, 11 Feb 2022 at 04:14, Roman Kondakov <ko...@mail.ru.invalid>
> > wrote:
> >
> >> Hi Vladimir,
> >>
> >> +1 for making the rule driver more public. We've faced similar problems
> >> in the downstream project. The solution was to copy and paste the
> >> TopDownRuleDrive code with small fixes since it was not possible to
> >> override the default behavior.
> >>
> >> --
> >> Roman Kondakov
> >>
> >>
> >> On 11.02.2022 02:50, Vladimir Ozerov wrote:
> >>> Hi,
> >>>
> >>> In the Cascades driver, it is possible to propagate the requests top-down
> >>> using the "passThrough", method and then notify parents bottom-up about
> >> the
> >>> concrete physical implementations of inputs using the "derive" method.
> >>>
> >>> In some optimizers, the valid parent node cannot be created before the
> >>> trait sets of inputs are known. An example is a custom distribution trait
> >>> that includes the number of shards in the system. The parent operator
> >> alone
> >>> may guess the distribution keys, but cannot know the number of input
> >>> shards. To mitigate this, you may create a "template" node with an
> >> infinite
> >>> cost from within the optimization rule that will propagate the
> >>> passThrough/drive calls but would never participate in the final plan.
> >>>
> >>> Currency, the top-down driver designed in a way that the nodes created
> >> from
> >>> the "passThrough" method are not notified on the "derive" stage. This
> >> leads
> >>> to the incomplete exploration of the search space. For example, the rule
> >>> may produce the node "A1.template" that will be converted into a normal
> >>> "A1" node in the derive phase. However, if the parent operator produced
> >>> "A2.template" from "A1.template" using pass-through mechanics, the
> >>> "A2.template" will never be notified about the concrete input traits,
> >>> possibly losing the optimal plan. This is especially painful in
> >> distributed
> >>> engines, where the number of shards is important for the placement of
> >>> Shuffle operators.
> >>>
> >>> It seems that the problem could be solved with relatively low effort. The
> >>> "derive" is not invoked on the nodes created from the "passThrough"
> >> method,
> >>> because such nodes are placed in the "passThroughCache" collection.
> >> Instead
> >>> of doing this unconditionally, we may introduce an additional predicate
> >>> that would selectively enforce "derive" on such nodes. For example, this
> >>> could be a default method in the PhysicalNode interface, like:
> >>>
> >>> interface PhysicalNode {
> >>>     default boolean enforceDerive() { return false; }
> >>> }
> >>>
> >>> If there are no objections, I'll proceed with this change.
> >>>
> >>> Alternatively, we may make the TopDownRuleDriver more "public", so that
> >> the
> >>> user can extend it and decide within the driver whether to cache a
> >>> particular node or not.
> >>>
> >>> I would appreciate your feedback on the matter.
> >>>
> >>> Regards,
> >>> Vladimir.
> >>>
> 

Re: Allow Cascades driver invoking "derive" on the nodes produced by "passThrough"

Posted by Julian Hyde <jh...@gmail.com>.
> So the idea of keeping even internal components of Calcite "more public" is rather a good thing than the bad one from my point of view.

This can go two ways.

There is the type of change that I call “drill a hole” where someone requests that an implementation detail be made public so that they can use it, but they don’t provide much justification, and don’t provide a use case that might be useful to others.

The other is where someone converts the internal mechanism into a genuine feature, with a use case, documentation, and comprehensive tests that help others understand and use the feature.

The “drill a hole” changes are a net burden on the Calcite project because we have to maintain the implementation going forward.

I’m not going to make a judgment about whether this proposed change is positive. The key thing is to think of it as adding a feature, not just adding the word “public” in one or two source files.

Julian


Re: Allow Cascades driver invoking "derive" on the nodes produced by "passThrough"

Posted by Roman Kondakov <ko...@mail.ru.INVALID>.
Hi Alessandro,

this problem was already discussed on dev-list [1] and we have a ticket 
for this [2].

My concern is that many projects use Calcite as a Lego kit: they took 
internal components of Calcite and combine them for building a custom 
planning and execution pipeline. And sometimes downstream projects need 
to change the default behavior of internal components to fit their 
requirements or overcome the bug. So the idea of keeping even internal 
components of Calcite "more public" is rather a good thing than the bad 
one from my point of view.

Thank you.

[1] https://lists.apache.org/thread/cykl74dcphgow4790fwoc8frsjglz7n1

[2] https://issues.apache.org/jira/browse/CALCITE-4542

--
Roman Kondakov


On 11.02.2022 19:15, Alessandro Solimando wrote:
> Hello everyone,
> @Vladimir, +1 on the change introducing "enforceDerive()".
>
> @Roman, could you walk us through the limitations you found that forced you
> to copy-paste the whole class?
>
> Maybe there is some middle ground for your problem(s) too, similar in
> spirit to what Vladimir proposed for the other limitation.
>
> I am not against making the class more public if necessary, but it would be
> nice to have a discussion here before going down that path.
> If the discussion leads to a better design of the original class, all
> projects would benefit from that.
>
> Best regards,
> Alessandro
>
> On Fri, 11 Feb 2022 at 04:14, Roman Kondakov <ko...@mail.ru.invalid>
> wrote:
>
>> Hi Vladimir,
>>
>> +1 for making the rule driver more public. We've faced similar problems
>> in the downstream project. The solution was to copy and paste the
>> TopDownRuleDrive code with small fixes since it was not possible to
>> override the default behavior.
>>
>> --
>> Roman Kondakov
>>
>>
>> On 11.02.2022 02:50, Vladimir Ozerov wrote:
>>> Hi,
>>>
>>> In the Cascades driver, it is possible to propagate the requests top-down
>>> using the "passThrough", method and then notify parents bottom-up about
>> the
>>> concrete physical implementations of inputs using the "derive" method.
>>>
>>> In some optimizers, the valid parent node cannot be created before the
>>> trait sets of inputs are known. An example is a custom distribution trait
>>> that includes the number of shards in the system. The parent operator
>> alone
>>> may guess the distribution keys, but cannot know the number of input
>>> shards. To mitigate this, you may create a "template" node with an
>> infinite
>>> cost from within the optimization rule that will propagate the
>>> passThrough/drive calls but would never participate in the final plan.
>>>
>>> Currency, the top-down driver designed in a way that the nodes created
>> from
>>> the "passThrough" method are not notified on the "derive" stage. This
>> leads
>>> to the incomplete exploration of the search space. For example, the rule
>>> may produce the node "A1.template" that will be converted into a normal
>>> "A1" node in the derive phase. However, if the parent operator produced
>>> "A2.template" from "A1.template" using pass-through mechanics, the
>>> "A2.template" will never be notified about the concrete input traits,
>>> possibly losing the optimal plan. This is especially painful in
>> distributed
>>> engines, where the number of shards is important for the placement of
>>> Shuffle operators.
>>>
>>> It seems that the problem could be solved with relatively low effort. The
>>> "derive" is not invoked on the nodes created from the "passThrough"
>> method,
>>> because such nodes are placed in the "passThroughCache" collection.
>> Instead
>>> of doing this unconditionally, we may introduce an additional predicate
>>> that would selectively enforce "derive" on such nodes. For example, this
>>> could be a default method in the PhysicalNode interface, like:
>>>
>>> interface PhysicalNode {
>>>     default boolean enforceDerive() { return false; }
>>> }
>>>
>>> If there are no objections, I'll proceed with this change.
>>>
>>> Alternatively, we may make the TopDownRuleDriver more "public", so that
>> the
>>> user can extend it and decide within the driver whether to cache a
>>> particular node or not.
>>>
>>> I would appreciate your feedback on the matter.
>>>
>>> Regards,
>>> Vladimir.
>>>

Re: Allow Cascades driver invoking "derive" on the nodes produced by "passThrough"

Posted by Alessandro Solimando <al...@gmail.com>.
Hello everyone,
@Vladimir, +1 on the change introducing "enforceDerive()".

@Roman, could you walk us through the limitations you found that forced you
to copy-paste the whole class?

Maybe there is some middle ground for your problem(s) too, similar in
spirit to what Vladimir proposed for the other limitation.

I am not against making the class more public if necessary, but it would be
nice to have a discussion here before going down that path.
If the discussion leads to a better design of the original class, all
projects would benefit from that.

Best regards,
Alessandro

On Fri, 11 Feb 2022 at 04:14, Roman Kondakov <ko...@mail.ru.invalid>
wrote:

> Hi Vladimir,
>
> +1 for making the rule driver more public. We've faced similar problems
> in the downstream project. The solution was to copy and paste the
> TopDownRuleDrive code with small fixes since it was not possible to
> override the default behavior.
>
> --
> Roman Kondakov
>
>
> On 11.02.2022 02:50, Vladimir Ozerov wrote:
> > Hi,
> >
> > In the Cascades driver, it is possible to propagate the requests top-down
> > using the "passThrough", method and then notify parents bottom-up about
> the
> > concrete physical implementations of inputs using the "derive" method.
> >
> > In some optimizers, the valid parent node cannot be created before the
> > trait sets of inputs are known. An example is a custom distribution trait
> > that includes the number of shards in the system. The parent operator
> alone
> > may guess the distribution keys, but cannot know the number of input
> > shards. To mitigate this, you may create a "template" node with an
> infinite
> > cost from within the optimization rule that will propagate the
> > passThrough/drive calls but would never participate in the final plan.
> >
> > Currency, the top-down driver designed in a way that the nodes created
> from
> > the "passThrough" method are not notified on the "derive" stage. This
> leads
> > to the incomplete exploration of the search space. For example, the rule
> > may produce the node "A1.template" that will be converted into a normal
> > "A1" node in the derive phase. However, if the parent operator produced
> > "A2.template" from "A1.template" using pass-through mechanics, the
> > "A2.template" will never be notified about the concrete input traits,
> > possibly losing the optimal plan. This is especially painful in
> distributed
> > engines, where the number of shards is important for the placement of
> > Shuffle operators.
> >
> > It seems that the problem could be solved with relatively low effort. The
> > "derive" is not invoked on the nodes created from the "passThrough"
> method,
> > because such nodes are placed in the "passThroughCache" collection.
> Instead
> > of doing this unconditionally, we may introduce an additional predicate
> > that would selectively enforce "derive" on such nodes. For example, this
> > could be a default method in the PhysicalNode interface, like:
> >
> > interface PhysicalNode {
> >    default boolean enforceDerive() { return false; }
> > }
> >
> > If there are no objections, I'll proceed with this change.
> >
> > Alternatively, we may make the TopDownRuleDriver more "public", so that
> the
> > user can extend it and decide within the driver whether to cache a
> > particular node or not.
> >
> > I would appreciate your feedback on the matter.
> >
> > Regards,
> > Vladimir.
> >
>

Re: Allow Cascades driver invoking "derive" on the nodes produced by "passThrough"

Posted by Roman Kondakov <ko...@mail.ru.INVALID>.
Hi Vladimir,

+1 for making the rule driver more public. We've faced similar problems 
in the downstream project. The solution was to copy and paste the 
TopDownRuleDrive code with small fixes since it was not possible to 
override the default behavior.

--
Roman Kondakov


On 11.02.2022 02:50, Vladimir Ozerov wrote:
> Hi,
>
> In the Cascades driver, it is possible to propagate the requests top-down
> using the "passThrough", method and then notify parents bottom-up about the
> concrete physical implementations of inputs using the "derive" method.
>
> In some optimizers, the valid parent node cannot be created before the
> trait sets of inputs are known. An example is a custom distribution trait
> that includes the number of shards in the system. The parent operator alone
> may guess the distribution keys, but cannot know the number of input
> shards. To mitigate this, you may create a "template" node with an infinite
> cost from within the optimization rule that will propagate the
> passThrough/drive calls but would never participate in the final plan.
>
> Currency, the top-down driver designed in a way that the nodes created from
> the "passThrough" method are not notified on the "derive" stage. This leads
> to the incomplete exploration of the search space. For example, the rule
> may produce the node "A1.template" that will be converted into a normal
> "A1" node in the derive phase. However, if the parent operator produced
> "A2.template" from "A1.template" using pass-through mechanics, the
> "A2.template" will never be notified about the concrete input traits,
> possibly losing the optimal plan. This is especially painful in distributed
> engines, where the number of shards is important for the placement of
> Shuffle operators.
>
> It seems that the problem could be solved with relatively low effort. The
> "derive" is not invoked on the nodes created from the "passThrough" method,
> because such nodes are placed in the "passThroughCache" collection. Instead
> of doing this unconditionally, we may introduce an additional predicate
> that would selectively enforce "derive" on such nodes. For example, this
> could be a default method in the PhysicalNode interface, like:
>
> interface PhysicalNode {
>    default boolean enforceDerive() { return false; }
> }
>
> If there are no objections, I'll proceed with this change.
>
> Alternatively, we may make the TopDownRuleDriver more "public", so that the
> user can extend it and decide within the driver whether to cache a
> particular node or not.
>
> I would appreciate your feedback on the matter.
>
> Regards,
> Vladimir.
>