You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Haisheng Yuan <hy...@apache.org> on 2021/04/21 01:31:19 UTC

Re: Trait propagation in heterogeneous plans

Hi Vladimir,

> There are two problems here. First, the project operator potentially
> destroys any trait which depends on column order, such as distribution or
> collation. Therefore, EnumerableProject has an incorrect value of the
> distribution trait.

The enumerable convention is intended for in-memory, non-distributed environment.
Therefore, we only consider 2 traits: collation and convention. Other traits are not
guaranteed to work correctly. If you want it work with distribution, you have to create
your own operators, rules, either by extending or overriding, in which case, you will need
to remap distribution columns to get the correct distribution trait, just like how collation does.

> Second, which distribution should I assign to the CustomToEnumerable node?
> As I know that parent convention cannot handle the distribution properly,
> my natural thought is to set it to ANY.

You can assume CustomToEnumerable to be an Enforcer operator, like Sort, Exchange.
Sort only changes data collation, Exchange changes data distribution and collation, similarly 
CustomToEnumerable only change convention, but retains collation and distribution, I assume.
But in practice, it should be decided by the operator inventor and the underlying physical
implementation.

Hope that answers your question. Feel free to ask if you have more questions.

Thanks,
Haisheng Yuan

On 2021/03/27 08:43:15, Vladimir Ozerov <pp...@gmail.com> wrote: 
> Hi,
> 
> Apache Calcite supports heterogeneous optimization when nodes may have
> different conventions. The Enumerable rules propagate all traits from
> inputs. We have doubts whether this is correct or not.
> 
> Consider the following initial plan, which was created by Apache Calcite
> after sql-to-rel conversion and invocation of TranslatableTable.toRel. The
> table is in the CUSTOM convention. In this convention, there is an
> additional Distribution trait that tracks which attribute is used for
> sharding. It could be either SHARDED or ANY. The latter is the default
> distribution value which is used when the distribution is unknown. Suppose
> that the table is distributed by the attribute $0.
> LogicalProject [convention=NONE,   distribution=ANY]
>   CustomTable  [convention=CUSTOM, distribution=SHARDED($0)]
> 
> Now suppose that we run VolcanoPlanner with two rules: EnumerableProjectRule
> and converter rules that translate the CUSTOM node to ENUMERABLE node.
> First, the EnumerableProjectRule is executed. This rule propagates traits
> from the input, replacing only convention. Notice, how it propagated the
> distribution trait.
> EnumerableProject [convention=ENUMERABLE, distribution=SHARDED($0)]
>   CustomTable     [convention=CUSTOM,     distribution=SHARDED($0)]
> 
> Next, the converter will be invoked, yielding the following final plan:
> EnumerableProject    [convention=ENUMERABLE, distribution=SHARDED($0)]
>   CustomToEnumerable [convention=ENUMERABLE, distribution=???]
>     CustomTable      [convention=CUSTOM,     distribution=SHARDED($0)]
> 
> There are two problems here. First, the project operator potentially
> destroys any trait which depends on column order, such as distribution or
> collation. Therefore, EnumerableProject has an incorrect value of the
> distribution trait.
> Second, which distribution should I assign to the CustomToEnumerable node?
> As I know that parent convention cannot handle the distribution properly,
> my natural thought is to set it to ANY. However, at least in the top-down
> optimizer, this will lead to CannotPlanException, unless I declare that [ANY
> satisfies SHARDED($0)], which is not the case: ANY is unknown distribution,
> so all distribution satisfies ANY, but not vice versa.
> 
> My question is - shouldn't we ensure that only the collation trait is
> propagated from child nodes in Enumerable rules? For example, in the
> EnumerableProjectRule instead of doing:
> input.getTraitSet()
>   .replace(EnumerableConvention.INSTANCE)
>   .replace(<newCollation>)
> 
> we may do:
> RelOptCluster.traitSet().
>   .replace(EnumerableConvention.INSTANCE)
>   .replace(<newCollation>)
> 
> This would ensure that all other traits are set to the default value. The
> generalization of this idea is that every convention has a set of supported
> traits. Every unsupported trait should be set to the default value.
> 
> I would appreciate your feedback on the matter.
> 
> Regards,
> Vladimir.
> 

Re: Trait propagation in heterogeneous plans

Posted by Vladimir Ozerov <pp...@gmail.com>.
It may propagate the in-core distribution in theory, if the relevant code
exists. Practically, there is no such code. For example, consider
EnumerableProject:

   1. EnumerableProjectRule.convert doesn't propagate input's distribution,
   thanks to EnumerableProject.create that uses RelOptCluster.traitSet.
   2. EnumerableProjectRule.derive also ignores all traits except for
   collation.

Therefore, irrespective of which trait set is present in the project's
input, the EnumerableProject will always have the default values for all
traits except for collation. This is what I refer to as "no trait
propagation". In this sense, EnumerableProject is an example of the correct
implementation wrt my proposal. But not all operators follow this, e.g.
EnumerableFilter.

чт, 6 мая 2021 г. в 14:39, Vladimir Sitnikov <si...@gmail.com>:

> >Enumerable in its current state cannot propagate any traits except for
> collation
>
> Enumerable can propagate in-core distribution trait.
>
> Vladimir
>

Re: Trait propagation in heterogeneous plans

Posted by Vladimir Sitnikov <si...@gmail.com>.
>Enumerable in its current state cannot propagate any traits except for
collation

Enumerable can propagate in-core distribution trait.

Vladimir

Re: Trait propagation in heterogeneous plans

Posted by Vladimir Ozerov <pp...@gmail.com>.
Hi,

I'd like to stress out that I am not trying to argue about subjective
concepts at all. Quite the opposite - I would like to agree or disagree on
a set of objective facts and find the solution. Specifically, from what I
saw in Calcite's codebase and real projects, I assert the following:

   1. Calcite-based projects may use custom traits.
   2. Enumerable in its current state cannot propagate any traits except
   for collation. The relevant code is simply missing from the product, it was
   never implemented.
   3. Despite (2), Enumerable rules/operators may demand unsupported traits
   from inputs, or expose unsupported traits, which may lead to problems on
   the user side (an example is in the first message of this thread).

Do you agree with these points?

If we are in agreement here, then I propose only one thing - fix (3),
because it affects real-life integrations. The fix is trivial:

   - Make sure that Enumerable operators never set non-default trait values
   for anything except for collation. For example, EnumerableProjectRule
   creates an operator with the correct trait set, whilst
   EnumerableFilterRule propagates unsupported traits.
   - Replace RelNode.getTraitSet with RelOptCluster.traitSet when deducing
   the desired input trait set in Enumerable rules.

These two fixes would ensure that we never have any non-default values of
any traits except for collation in Enumerable operators. On the one hand,
it fixes (3). On the other hand, it doesn't break anything, because thanks
to (2) there is nothing to break.

Does it make sense to you?

Regards,
Vladimir.


чт, 6 мая 2021 г. в 10:35, Vladimir Sitnikov <si...@gmail.com>:

> Vladimir,
>
> I generally agree with what you are saying,
>
> >Enumerable backend provides a clear and consistent contract: we support
> collation and reset everything
>
> That sounds like a way to go until there's a way to externalize "input
> trait enforcement" rules.
> "output" traits are simpler since they can be computed with metadataquery
> (however, we still hard-code the set of computed traits).
> It might be worth trying to compute all the traits known to the planner.
>
> However, Enumerable could play well with in-core distribution trait as
> well, so there's no need to limit enumerable to "collation only".
>
> If you don't like in-core distribution trait, you just do not use it.
> There's no much sense in limiting enumerable to collation only.
>
> Vladimir
>

Re: Trait propagation in heterogeneous plans

Posted by Vladimir Sitnikov <si...@gmail.com>.
Vladimir,

I generally agree with what you are saying,

>Enumerable backend provides a clear and consistent contract: we support
collation and reset everything

That sounds like a way to go until there's a way to externalize "input
trait enforcement" rules.
"output" traits are simpler since they can be computed with metadataquery
(however, we still hard-code the set of computed traits).
It might be worth trying to compute all the traits known to the planner.

However, Enumerable could play well with in-core distribution trait as
well, so there's no need to limit enumerable to "collation only".

If you don't like in-core distribution trait, you just do not use it.
There's no much sense in limiting enumerable to collation only.

Vladimir

Re: Trait propagation in heterogeneous plans

Posted by Julian Hyde <jh...@gmail.com>.
Vladimir,

You are arguing for pragmatism over idealism. I get that.

The problem with your argument is that you go on to say

> If in the future we invest in the
> proper integration 

That’s a big “If”. Who is the “we” who is going to do this work? Now you are the one being unrealistic.

Calcite is a sophisticated framework that has many high-level abstractions to support scenarios that are not tested in the core code base. We built those abstractions by being idealistic. We couldn’t possibly test them because we didn’t have the use case to exercise them.

How do these abstractions get fully baked into production quality? When the downstream projects that need them refine the features, and contribute fixes back.

It’s not in Calcite’s interests to make it easy for downstream projects to fork the code when they need to do the complex stuff. We need to use our abstractions (in this case, the idea that traits are pluggable) and if those abstractions are wrong or limiting, those downstream projects will come and fix them.

Julian



> On May 5, 2021, at 12:32 PM, Vladimir Ozerov <pp...@gmail.com> wrote:
> 
> Hi Vladimir, Julian,
> 
> I want to distinguish between two cases.
> 
> Some projects may decide to use Calcite's distribution trait. To my
> knowledge, this is not a common pattern because it is not really integrated
> into Calcite. It is not destroyed/adjusted in rules and operators as
> needed, not integrated into EnumerableConvention.enforce, etc.
> 
> Other projects may decide to use a custom distribution trait. Examples are
> Apache Flink, Hazelcast, and some other private projects we work on. There
> are many reasons to do this. A couple of examples:
> 1. Calcite's distribution produces logical exchange, while production
> grade-optimizers are typically multi-phase and want the distribution
> convention to produce physical exchanges in a dedicated physical phase(s).
> 2. Some systems may have custom requirements for distribution, such as
> propagating the number of shards, supporting multiple equivalent keys, etc.
> 
> But in both cases, the bottom line is that the Enumerable currently cannot
> work with both built-in and custom distributions because the associated
> code is not implemented in Calcite's core. And even if we add the
> fully-fledged support of the built-in distribution to Enumerable, many
> projects will continue using custom distribution traits because the
> exchange is a physical operation with lots of backend-dependent specific
> quirks, and any attempt to model it abstractly in Calcite's core is
> unlikely to cover some edge cases.
> 
> The same applies to any other custom trait that depends on columns -
> Enumerable will not be able to process it correctly.
> 
> Therefore, instead of having a definitively broken code, it might be better
> to apply the defensive approach when the whole Enumerable backend provides
> a clear and consistent contract: we support collation and reset everything
> else. IMO it is better because it matches the current behavior and would
> never cause strange bugs in a user code. If in the future we invest in the
> proper integration of the built-in distribution or figure out how to
> "externalize" the trait propagation for Enumerable operators, we may relax
> this statement.
> 
> Please let me know if it makes any sense.
> 
> Regards,
> Vladimir.
> 
> вт, 4 мая 2021 г. в 21:02, Julian Hyde <jh...@apache.org>:
> 
>>> I would say known in-core vs unknown trait is a reasonable approach to
>>> distingush traits.
>> 
>> Easy, but not reasonable. It will make it very difficult to reuse
>> existing rels and rules (e.g. Enumerable) in a downstream project that
>> has defined its own traits.
>> 
>> On Tue, May 4, 2021 at 10:44 AM Vladimir Sitnikov
>> <si...@gmail.com> wrote:
>>> 
>>>> It seems arbitrary to include Collation but exclude other traits.
>>> 
>>> I would say known in-core vs unknown trait is a reasonable approach to
>>> distingush traits.
>>> 
>>> Vladimir
>> 


Re: Trait propagation in heterogeneous plans

Posted by Vladimir Ozerov <pp...@gmail.com>.
Hi Vladimir, Julian,

I want to distinguish between two cases.

Some projects may decide to use Calcite's distribution trait. To my
knowledge, this is not a common pattern because it is not really integrated
into Calcite. It is not destroyed/adjusted in rules and operators as
needed, not integrated into EnumerableConvention.enforce, etc.

Other projects may decide to use a custom distribution trait. Examples are
Apache Flink, Hazelcast, and some other private projects we work on. There
are many reasons to do this. A couple of examples:
1. Calcite's distribution produces logical exchange, while production
grade-optimizers are typically multi-phase and want the distribution
convention to produce physical exchanges in a dedicated physical phase(s).
2. Some systems may have custom requirements for distribution, such as
propagating the number of shards, supporting multiple equivalent keys, etc.

But in both cases, the bottom line is that the Enumerable currently cannot
work with both built-in and custom distributions because the associated
code is not implemented in Calcite's core. And even if we add the
fully-fledged support of the built-in distribution to Enumerable, many
projects will continue using custom distribution traits because the
exchange is a physical operation with lots of backend-dependent specific
quirks, and any attempt to model it abstractly in Calcite's core is
unlikely to cover some edge cases.

The same applies to any other custom trait that depends on columns -
Enumerable will not be able to process it correctly.

Therefore, instead of having a definitively broken code, it might be better
to apply the defensive approach when the whole Enumerable backend provides
a clear and consistent contract: we support collation and reset everything
else. IMO it is better because it matches the current behavior and would
never cause strange bugs in a user code. If in the future we invest in the
proper integration of the built-in distribution or figure out how to
"externalize" the trait propagation for Enumerable operators, we may relax
this statement.

Please let me know if it makes any sense.

Regards,
Vladimir.

вт, 4 мая 2021 г. в 21:02, Julian Hyde <jh...@apache.org>:

> > I would say known in-core vs unknown trait is a reasonable approach to
> > distingush traits.
>
> Easy, but not reasonable. It will make it very difficult to reuse
> existing rels and rules (e.g. Enumerable) in a downstream project that
> has defined its own traits.
>
> On Tue, May 4, 2021 at 10:44 AM Vladimir Sitnikov
> <si...@gmail.com> wrote:
> >
> > > It seems arbitrary to include Collation but exclude other traits.
> >
> > I would say known in-core vs unknown trait is a reasonable approach to
> > distingush traits.
> >
> > Vladimir
>

Re: Trait propagation in heterogeneous plans

Posted by Julian Hyde <jh...@apache.org>.
> I would say known in-core vs unknown trait is a reasonable approach to
> distingush traits.

Easy, but not reasonable. It will make it very difficult to reuse
existing rels and rules (e.g. Enumerable) in a downstream project that
has defined its own traits.

On Tue, May 4, 2021 at 10:44 AM Vladimir Sitnikov
<si...@gmail.com> wrote:
>
> > It seems arbitrary to include Collation but exclude other traits.
>
> I would say known in-core vs unknown trait is a reasonable approach to
> distingush traits.
>
> Vladimir

Re: Trait propagation in heterogeneous plans

Posted by Vladimir Sitnikov <si...@gmail.com>.
> It seems arbitrary to include Collation but exclude other traits.

I would say known in-core vs unknown trait is a reasonable approach to
distingush traits.

Vladimir

Re: Trait propagation in heterogeneous plans

Posted by Julian Hyde <jh...@gmail.com>.
It seems arbitrary to include Collation but exclude other traits. Convention is, and should remain, the only “special” trait.

Distribution does apply to Enumerable operators. Other traits, including those defined by the user without modifying core Calcite, should also be supported.

I acknowledge that it’s not easy to make an API to make that happen. The "RelTraitSet.replaceIf(RelTraitDef, Supplier<RelTrait>)” was my attempt to move in that direction.

It’s problematic to test this stuff, because most of the real examples in Calcite core only use the Convention and Collation traits.

Julian




> On May 4, 2021, at 3:00 AM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> Vladimir Sitnikov>in the other hand, Project might indeed affect
> distribution,
> Vladimir Sitnikov>so EnumerableProject could keep distribution trait just
> fine.
> 
> Correction: "EnumerableFilter could keep distribution while
> EnumerableProject should destroy or compute it"
> 
> Vladimir


Re: Trait propagation in heterogeneous plans

Posted by Vladimir Sitnikov <si...@gmail.com>.
Vladimir Sitnikov>in the other hand, Project might indeed affect
distribution,
Vladimir Sitnikov>so EnumerableProject could keep distribution trait just
fine.

Correction: "EnumerableFilter could keep distribution while
EnumerableProject should destroy or compute it"

Vladimir

Re: Trait propagation in heterogeneous plans

Posted by Vladimir Sitnikov <si...@gmail.com>.
>Examples of the problem: EnumerableWindowRule,
>   EnumerableFilterRule

That looks like a bug, and we should probably fix it (ensure #create
methods are used that compute traits as needed)

However, it looks like the common pattern is to compute traits from
metadata query, so
the question might be "why do we hard-code a couple of
collation+distribution traits? What if we re-compute all the traits?".

>But given that the distribution is not supported by Enumerable

In practice, EnumerableFilter won't affect distribution, so it could be OK
to keep distribution trait for EnumerableFilter.
On the other hand, Project might indeed affect distribution, so
EnumerableProject could keep distribution trait just fine.

I'm not sure "enumerable does not support distribution" is the right way to
put things.

distribution is an in-core trait, so I believe EnumerableProject and
EnumerableFitler should support it properly.

However, it might indeed be sane to reset all the unknown traits to the
unknown state for "input enforcement".
I am not sure if we could/should make input enforcement customizable with
something like metadataquery though.

Vladimir

Re: Trait propagation in heterogeneous plans

Posted by Vladimir Ozerov <pp...@gmail.com>.
Hi Vladimir,

I couldn't share the reproducer, as it is behind the NDA. But the problem
is evident from the code.

There are two distinct issues actually:

   1. Propagation of unsupported traits in operators. EnumerableProject is
   not affected. Examples of the problem: EnumerableWindowRule,
   EnumerableFilterRule
   2. Incorrect enforcement of the input traits. Example:
   EnumerableProjectRule.convert. Imagine that I have an input with some
   custom trait, say, distribution. The EnumerableProjectRule may require
   the input to satisfy some specific distribution. But given that the
   distribution is not supported by Enumerable, I want to destroy the
   distribution in my convention enforcer. If I do so, I get the
   CannotPlanException, because the created EnumerableProject incorrectly
   requires the specific distribution from the input.

Regards,
Vladimir.

вт, 4 мая 2021 г. в 11:06, Vladimir Sitnikov <si...@gmail.com>:

> >First, the EnumerableProjectRule is executed. This rule propagates traits
> >from the input, replacing only convention.
>
> Vladimir, could you please share a reproducer?
>
> EnumerableProject#create explicitly resets all the traits for
> EnumerableProject except convention=enumerable, and
> collation=computed_with_metadataquery
> In practice, it could compute distribution traits as well, however, that is
> missing.
>
> Are you sure you get EnumerableProject with non-default distribution
> somehow?
>
> Vladimir
>

Re: Trait propagation in heterogeneous plans

Posted by Vladimir Sitnikov <si...@gmail.com>.
>First, the EnumerableProjectRule is executed. This rule propagates traits
>from the input, replacing only convention.

Vladimir, could you please share a reproducer?

EnumerableProject#create explicitly resets all the traits for
EnumerableProject except convention=enumerable, and
collation=computed_with_metadataquery
In practice, it could compute distribution traits as well, however, that is
missing.

Are you sure you get EnumerableProject with non-default distribution
somehow?

Vladimir

Re: Trait propagation in heterogeneous plans

Posted by Stamatis Zampetakis <za...@gmail.com>.
Hi Vladimir,

I find it completely reasonable to propagate only supported traits and not
all kinds of them.

Per the classic Volcano paper [1], "enforcers" (I suppose "implementation
algorithms" as well) can ensure multiple physical properties and destroy
some others, which is also inline with what you propose.
In terms of implementation I don't know what this change might break but
seems the correct thing to do.
If I remember well trait propagation is under a boolean flag so just make
sure that the change makes sense also when the property is disabled.

Best,
Stamatis

[1]
https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/Volcano-graefe.pdf

On Tue, May 4, 2021 at 9:37 AM Vladimir Ozerov <pp...@gmail.com> wrote:

> Hi Haisheng,
>
> My original problem was with how Enumerable propagates traits. Many
> Enumerable rules copy traits from the child operator. This seems wrong
> because, as you mentioned, Enumerable supports only collation. Propagation
> of the unsupported traits may lead to CannotPlanException as in the example
> above when having a plan with multiple conventions.
>
> Therefore, the proposal is to change Enumerable rules, so that they
> propagate only collation, but not other traits. Does it make sense?
>
> Regards,
> Vladimir.
>
> ср, 21 апр. 2021 г. в 04:31, Haisheng Yuan <hy...@apache.org>:
>
> > Hi Vladimir,
> >
> > > There are two problems here. First, the project operator potentially
> > > destroys any trait which depends on column order, such as distribution
> or
> > > collation. Therefore, EnumerableProject has an incorrect value of the
> > > distribution trait.
> >
> > The enumerable convention is intended for in-memory, non-distributed
> > environment.
> > Therefore, we only consider 2 traits: collation and convention. Other
> > traits are not
> > guaranteed to work correctly. If you want it work with distribution, you
> > have to create
> > your own operators, rules, either by extending or overriding, in which
> > case, you will need
> > to remap distribution columns to get the correct distribution trait, just
> > like how collation does.
> >
> > > Second, which distribution should I assign to the CustomToEnumerable
> > node?
> > > As I know that parent convention cannot handle the distribution
> properly,
> > > my natural thought is to set it to ANY.
> >
> > You can assume CustomToEnumerable to be an Enforcer operator, like Sort,
> > Exchange.
> > Sort only changes data collation, Exchange changes data distribution and
> > collation, similarly
> > CustomToEnumerable only change convention, but retains collation and
> > distribution, I assume.
> > But in practice, it should be decided by the operator inventor and the
> > underlying physical
> > implementation.
> >
> > Hope that answers your question. Feel free to ask if you have more
> > questions.
> >
> > Thanks,
> > Haisheng Yuan
> >
> > On 2021/03/27 08:43:15, Vladimir Ozerov <pp...@gmail.com> wrote:
> > > Hi,
> > >
> > > Apache Calcite supports heterogeneous optimization when nodes may have
> > > different conventions. The Enumerable rules propagate all traits from
> > > inputs. We have doubts whether this is correct or not.
> > >
> > > Consider the following initial plan, which was created by Apache
> Calcite
> > > after sql-to-rel conversion and invocation of TranslatableTable.toRel.
> > The
> > > table is in the CUSTOM convention. In this convention, there is an
> > > additional Distribution trait that tracks which attribute is used for
> > > sharding. It could be either SHARDED or ANY. The latter is the default
> > > distribution value which is used when the distribution is unknown.
> > Suppose
> > > that the table is distributed by the attribute $0.
> > > LogicalProject [convention=NONE,   distribution=ANY]
> > >   CustomTable  [convention=CUSTOM, distribution=SHARDED($0)]
> > >
> > > Now suppose that we run VolcanoPlanner with two rules:
> > EnumerableProjectRule
> > > and converter rules that translate the CUSTOM node to ENUMERABLE node.
> > > First, the EnumerableProjectRule is executed. This rule propagates
> traits
> > > from the input, replacing only convention. Notice, how it propagated
> the
> > > distribution trait.
> > > EnumerableProject [convention=ENUMERABLE, distribution=SHARDED($0)]
> > >   CustomTable     [convention=CUSTOM,     distribution=SHARDED($0)]
> > >
> > > Next, the converter will be invoked, yielding the following final plan:
> > > EnumerableProject    [convention=ENUMERABLE, distribution=SHARDED($0)]
> > >   CustomToEnumerable [convention=ENUMERABLE, distribution=???]
> > >     CustomTable      [convention=CUSTOM,     distribution=SHARDED($0)]
> > >
> > > There are two problems here. First, the project operator potentially
> > > destroys any trait which depends on column order, such as distribution
> or
> > > collation. Therefore, EnumerableProject has an incorrect value of the
> > > distribution trait.
> > > Second, which distribution should I assign to the CustomToEnumerable
> > node?
> > > As I know that parent convention cannot handle the distribution
> properly,
> > > my natural thought is to set it to ANY. However, at least in the
> top-down
> > > optimizer, this will lead to CannotPlanException, unless I declare that
> > [ANY
> > > satisfies SHARDED($0)], which is not the case: ANY is unknown
> > distribution,
> > > so all distribution satisfies ANY, but not vice versa.
> > >
> > > My question is - shouldn't we ensure that only the collation trait is
> > > propagated from child nodes in Enumerable rules? For example, in the
> > > EnumerableProjectRule instead of doing:
> > > input.getTraitSet()
> > >   .replace(EnumerableConvention.INSTANCE)
> > >   .replace(<newCollation>)
> > >
> > > we may do:
> > > RelOptCluster.traitSet().
> > >   .replace(EnumerableConvention.INSTANCE)
> > >   .replace(<newCollation>)
> > >
> > > This would ensure that all other traits are set to the default value.
> The
> > > generalization of this idea is that every convention has a set of
> > supported
> > > traits. Every unsupported trait should be set to the default value.
> > >
> > > I would appreciate your feedback on the matter.
> > >
> > > Regards,
> > > Vladimir.
> > >
> >
>

Re: Trait propagation in heterogeneous plans

Posted by Vladimir Ozerov <pp...@gmail.com>.
Hi Haisheng,

My original problem was with how Enumerable propagates traits. Many
Enumerable rules copy traits from the child operator. This seems wrong
because, as you mentioned, Enumerable supports only collation. Propagation
of the unsupported traits may lead to CannotPlanException as in the example
above when having a plan with multiple conventions.

Therefore, the proposal is to change Enumerable rules, so that they
propagate only collation, but not other traits. Does it make sense?

Regards,
Vladimir.

ср, 21 апр. 2021 г. в 04:31, Haisheng Yuan <hy...@apache.org>:

> Hi Vladimir,
>
> > There are two problems here. First, the project operator potentially
> > destroys any trait which depends on column order, such as distribution or
> > collation. Therefore, EnumerableProject has an incorrect value of the
> > distribution trait.
>
> The enumerable convention is intended for in-memory, non-distributed
> environment.
> Therefore, we only consider 2 traits: collation and convention. Other
> traits are not
> guaranteed to work correctly. If you want it work with distribution, you
> have to create
> your own operators, rules, either by extending or overriding, in which
> case, you will need
> to remap distribution columns to get the correct distribution trait, just
> like how collation does.
>
> > Second, which distribution should I assign to the CustomToEnumerable
> node?
> > As I know that parent convention cannot handle the distribution properly,
> > my natural thought is to set it to ANY.
>
> You can assume CustomToEnumerable to be an Enforcer operator, like Sort,
> Exchange.
> Sort only changes data collation, Exchange changes data distribution and
> collation, similarly
> CustomToEnumerable only change convention, but retains collation and
> distribution, I assume.
> But in practice, it should be decided by the operator inventor and the
> underlying physical
> implementation.
>
> Hope that answers your question. Feel free to ask if you have more
> questions.
>
> Thanks,
> Haisheng Yuan
>
> On 2021/03/27 08:43:15, Vladimir Ozerov <pp...@gmail.com> wrote:
> > Hi,
> >
> > Apache Calcite supports heterogeneous optimization when nodes may have
> > different conventions. The Enumerable rules propagate all traits from
> > inputs. We have doubts whether this is correct or not.
> >
> > Consider the following initial plan, which was created by Apache Calcite
> > after sql-to-rel conversion and invocation of TranslatableTable.toRel.
> The
> > table is in the CUSTOM convention. In this convention, there is an
> > additional Distribution trait that tracks which attribute is used for
> > sharding. It could be either SHARDED or ANY. The latter is the default
> > distribution value which is used when the distribution is unknown.
> Suppose
> > that the table is distributed by the attribute $0.
> > LogicalProject [convention=NONE,   distribution=ANY]
> >   CustomTable  [convention=CUSTOM, distribution=SHARDED($0)]
> >
> > Now suppose that we run VolcanoPlanner with two rules:
> EnumerableProjectRule
> > and converter rules that translate the CUSTOM node to ENUMERABLE node.
> > First, the EnumerableProjectRule is executed. This rule propagates traits
> > from the input, replacing only convention. Notice, how it propagated the
> > distribution trait.
> > EnumerableProject [convention=ENUMERABLE, distribution=SHARDED($0)]
> >   CustomTable     [convention=CUSTOM,     distribution=SHARDED($0)]
> >
> > Next, the converter will be invoked, yielding the following final plan:
> > EnumerableProject    [convention=ENUMERABLE, distribution=SHARDED($0)]
> >   CustomToEnumerable [convention=ENUMERABLE, distribution=???]
> >     CustomTable      [convention=CUSTOM,     distribution=SHARDED($0)]
> >
> > There are two problems here. First, the project operator potentially
> > destroys any trait which depends on column order, such as distribution or
> > collation. Therefore, EnumerableProject has an incorrect value of the
> > distribution trait.
> > Second, which distribution should I assign to the CustomToEnumerable
> node?
> > As I know that parent convention cannot handle the distribution properly,
> > my natural thought is to set it to ANY. However, at least in the top-down
> > optimizer, this will lead to CannotPlanException, unless I declare that
> [ANY
> > satisfies SHARDED($0)], which is not the case: ANY is unknown
> distribution,
> > so all distribution satisfies ANY, but not vice versa.
> >
> > My question is - shouldn't we ensure that only the collation trait is
> > propagated from child nodes in Enumerable rules? For example, in the
> > EnumerableProjectRule instead of doing:
> > input.getTraitSet()
> >   .replace(EnumerableConvention.INSTANCE)
> >   .replace(<newCollation>)
> >
> > we may do:
> > RelOptCluster.traitSet().
> >   .replace(EnumerableConvention.INSTANCE)
> >   .replace(<newCollation>)
> >
> > This would ensure that all other traits are set to the default value. The
> > generalization of this idea is that every convention has a set of
> supported
> > traits. Every unsupported trait should be set to the default value.
> >
> > I would appreciate your feedback on the matter.
> >
> > Regards,
> > Vladimir.
> >
>