You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Haisheng Yuan <h....@alibaba-inc.com> on 2019/04/07 04:38:29 UTC

[DISCUSS] RelCompositeTrait

Hi,

I found there are some RelCompositeTrait related issues:
https://issues.apache.org/jira/browse/CALCITE-2010
https://issues.apache.org/jira/browse/CALCITE-2593
https://issues.apache.org/jira/browse/CALCITE-2764

Multi-sorted table are rare in pratice, mutil-distributed table doesn't exist either. Values node with several tuples is not worth optimization, with many tuples is not worth optimization either, because the time it takes optimizer to figure out the ordering may be longer than just sort it in runtime.

In issue https://issues.apache.org/jira/browse/CALCITE-1990,
Leo extended RelDistribution to inherit RelMultipleTrait, just like RelCollation does, to solve his problem in the example. But I don't think this is an appropriate way to represent the equivalence classes (in PostgreSQL's term). 

So why did we introduce RelCompisteTrait and RelMultipleTrait in the beginning? Seems like it gives us more pain than gain.

Thanks ~
Haisheng Yuan

Re: [DISCUSS] RelCompositeTrait

Posted by Stamatis Zampetakis <za...@gmail.com>.
I will try to find some time the following week to look into the
problem/proposal in CALCITE-2593.

I don't like stalling things but if possible let's wait a bit more. There
are various parts in the code indicating that the composite traits
should not be part of a RelSubset. I was thinking that we should try to
maintain this invariant if possible.

On Wed, Apr 17, 2019 at 2:48 AM Hongze Zhang <no...@126.com> wrote:

> You are right, removing the collations could only workaround what causes
> us to find the issue on multi-sort. Maybe we'd better not to remove them
> (at least currently) since they provide a way to easily test against
> composite trait.
>
> Hongze
>
> > On Apr 17, 2019, at 01:15, Haisheng Yuan <h....@alibaba-inc.com> wrote:
> >
> >> it looks like if we want to get these problems fixed quickly we can
> just remove
> >> EnumerableValues's collation emitting.
> >
> > I am afraid even removing Values collation enumeration won't actually
> give it a quick fix,
> > because Multi-sorted table, if there is, might still encounter the same
> issue with Values.
> >
> >
> > Thanks ~
> > Haisheng Yuan
> > ------------------------------------------------------------------
> > 发件人:Hongze Zhang<no...@126.com>
> > 日 期:2019年04月16日 18:10:17
> > 收件人:<de...@calcite.apache.org>
> > 主 题:Re: [DISCUSS] RelCompositeTrait
> >
> > If we minimize the issue scope to Calcite itself, I think the 3 JIRA
> > tickets: CALCITE-2010, CALCITE-2593, CALCITE-2764 that Haisheng has
> > listed (thanks, Haisheng!) are all related to the multi-sorted
> > EnumerableValue more or less. An it looks like if we want to get these
> > problems fixed quickly we can just remove EnumerableValues's collation
> > emitting. I recall that (correct me if I am wrong) the rel is even not
> > able to emit descending collations, so I suppose it is not perfect at
> > first.
> >
> > And another discussion is about enumerating traits. IMHO it's hard to
> > tell Calcite didn't really try avoiding enumerating them already. The
> > methods RelCollationImpl#satisfies[1] and RelDistributions#satisfies[2]
> > already did a job of testing the relationship between traits without
> > checking equality. So the whole thing is looking like we already tried
> > to not to enumerate them but failed at last.
> >
> > Regarding the composite traits, one embarrassing thing I can see so far
> > is about the method RelTraitSet#simplify[3]. The JavaDoc says the method
> > is to "return a trait set similar to this one but with all composite
> > traits flattened". But when we look into the related implementations
> >
> RelCollationTraitDef#getDefault[4]/RelDistributionTraitDef#getDefault[5],
> > they seem not to flatten anything, the traits just simply get wiped.
> > This causes me to worry about if it is really correct that
> > RelCollation/RelDistribution extends RelMultipleTrait, because we can't
> > leverage the trait simplification but are actually hurt by it. If a rel
> > loses it's physical property, we can never prevent from adding
> > unnecessary sorts/exchanges.
> >
> > Besides, even if we decide to add some extra sorts/exchanges, that is
> > somehow not easy so far. See CALCITE-2592/CALCITE-2970, the planner is
> > not that smooth to automatically add them.
> >
> > Overall, regarding these "small" problems, I think none of them is
> > really impossible to be solved (yes coming up right solutions may be not
> > that straightforward). But of course in future if a brand new design can
> > be proposed to improve the entire trait system (such as avoid
> > enumerating traits), I think that would be totally a great thing.
> >
> > Best,
> > Hongze
> >
> >
> > [1]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationImpl.java#L118
> > [2]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributions.java#L143
> > [3]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/plan/RelTraitSet.java#L526-L538
> > [4]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationTraitDef.java#L58-L60
> > [5]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributionTraitDef.java#L47-L49
> >
> > ------ Original Message ------
> > From: "Haisheng Yuan" <h....@alibaba-inc.com>
> > To: "Jacques Nadeau" <ja...@apache.org>; "Apache Calcite dev list"
> > <de...@calcite.apache.org>
> > Sent: 2019/4/15 12:27:04
> > Subject: Re: Re: [DISCUSS] RelCompositeTrait
> >
> >>> There are major challenges with asking for particular traits as well.
> >> Imagine a desired aggregate on 7 columns. What does the requestor
> request
> >> with regards to distribution? All seven columns? One column? Some
> >> combination in between?
> >>
> >> The same challenges exist for enumerating all the traits as well.
> Imagine
> >> there is an order by the 7 grouping keys on top of the aggregate on 7
> columns,
> >> but with different sort direction:
> >> select * from foo group by a,b,c... order by c desc, a asc, b desc...
> >> What sort order, direction should the sort-based stream aggregate
> provide?
> >> All ascending, all descending, order (a,b,c...), order(..c,b,a), or all
> the combination?
> >> All of those enumerated traits are useless except one; for others,
> additional
> >> sort operator will be needed.
> >>
> >> Another example is aggregate on top of join, where join on 7 keys, and
> aggregate
> >> on 2 of the join keys. In distributed system, what distribution trait
> would the join
> >> operator provide? The 2 grouping keys? All the join keys? All the
> combination?
> >>
> >> Enumerating some/all the deliverable traits, is not prupose driven. All
> the traits
> >> may be just useless for parent operator. On the other hand, asking the
> child
> >> operator particular traits, is purpose driven, at least the traits
> asked by parent
> >> operator are worth consideration, not as wasteful as the former.
> >>
> >> If I understand RelCompositeTrait's intent correctly, the enumerated
> traits, no
> >> matter some combination or all combination, should be saved here. But
> in fact,
> >> it seems not. And as Jacques mentioned, many people rely on RelMetadata
> >> operations to pull up the traitsets through operators.
> >>
> >> This makes me curious and wonder if there are any true use cases or
> systems
> >> who rely on RelCompositeTrait. If someone has the story, we would love
> to hear.
> >>
> >> Put that aside, even RelCompositeTrait is indispensible, why do we
> bother optimizing
> >> Values node? For values with several tuples, it is not worth
> optimization, with
> >> many tuples, it may take more time to enumerate the RelCollation than
> just sorting it.
> >> Specifically for Values with 0 or 1 tuple, but with many columns, it is
> definitely not
> >> worth the optimization, because sort removal rule and empty rel removal
> rule should
> >> do the work.
> >>
> >>
> >> Thanks ~
> >> Haisheng Yuan
> >> ------------------------------------------------------------------
> >> 发件人:Jacques Nadeau<ja...@apache.org>
> >> 日 期:2019年04月15日 07:36:51
> >> 收件人:<de...@calcite.apache.org>
> >> 主 题:Re: [DISCUSS] RelCompositeTrait
> >>
> >> There are major challenges with asking for particular traits as well.
> >> Imagine a desired aggregate on 7 columns. What does the requestor
> request
> >> with regards to distribution? All seven columns? One column? Some
> >> combination in between? The trait system in Calcite is very challenging
> to
> >> work with because it is up to downstream users to try to figure out
> trait
> >> propagation outside the core. So challenging, that I believe that many
> >> people move to relying on RelMetadata operations since those can be
> pulled
> >> across several operators at once.
> >>
> >> It would be great if someone could spend the time to come up with a more
> >> global design for these items and we avoid solving one-off problems.
> >> Rationalizing when something should be trait, how to avoid trait
> planning
> >> cost explosion, how to propagate, when something should be handled via
> >> RelMetadataQuery, when something should be managed via traits versus
> >> materialized view alternatives, etc.
> >>
> >> An example of overlapping functionality I'd start with is: should
> >> multitraits for collation really exist or would exposing these as
> >> materialized view alternatives be more appropriate? Why is it necessary
> to
> >> have a 'shortcut' for this situation while other alternatives don't have
> >> one?
> >>
> >>
> >>
> >> On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <jh...@apache.org> wrote:
> >>
> >>> It seemed reasonable when I introduced it, and seems very reasonable,
> that
> >>> a relational expression (even in the relational model) can have
> multiple
> >>> physical properties. Consider these questions that the planner might
> ask:
> >>>
> >>> Example 1:
> >>>
> >>> “Are you sorted on hiredate?”
> >>> “Yes”
> >>> “Are you sorted on empno?”
> >>> “Yes”
> >>> “Are you sorted on deptno?”
> >>> “No”
> >>>
> >>> Example 2:
> >>>
> >>> “Can you fit into less than 100MB of memory?”
> >>> “Yes”
> >>> “Can you fit into less than 10MB of memory?”
> >>> “Yes”
> >>> “Can you fit into less than 1MB of memory?”
> >>> “No”
> >>>
> >>> We manage traits like those in example 1 using RelCompositeTrait. We
> can’t
> >>> handle traits like this in example 2, and so we have trained ourselves
> to
> >>> not think of “can fit into memory X” as a trait at all.
> >>>
> >>> Perhaps our mistake is to have an API “tell me all of your traits”
> rather
> >>> than an API “do you have trait X?”. Asking a RelNode to enumerate its
> >>> traits can be painful: the extreme case is an empty Values with 100
> >>> columns; it satisfies any sort order, and there are 100! of these.
> >>>
> >>> Julian
> >>>
> >>>
> >>>
> >>>> On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <za...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi Haisheng,
> >>>>
> >>>> Thanks for raising awareness around this topic. I also think we should
> >>> try
> >>>> to find a solution.
> >>>>
> >>>> Initially, the Volcano planner was designed to be able to cover
> multiple
> >>>> models (and not only the relational). For non-relational models
> composite
> >>>> traits may be indispensable. I don't know if there are people in this
> >>> list
> >>>> that are using the planner for other models but if there are it would
> be
> >>>> nice to hear from them.
> >>>>
> >>>> Focusing exclusively on the relational model, I think composite traits
> >>> are
> >>>> useful. One use-case that comes to my mind is data replication. It
> >>>> perfectly makes sense to partition (distribute) your table on two (or
> >>> more)
> >>>> columns to be able execute efficiently queries using special partition
> >>>> joins. A concrete use-case is RDF data where many distributed systems
> >>> store
> >>>> the triples table partitioned by subject and object. I guess such
> >>> use-cases
> >>>> could possibly be modelled in other ways but composite traits is what
> >>> comes
> >>>> naturally to my mind.
> >>>>
> >>>> Regarding multi-sorted tables it is not that rare if you import sorted
> >>> data
> >>>> into a table with an auto-increment primary key for example.
> >>>>
> >>>> I think all the trait-related issues can be solved if we prioritize
> them
> >>>> correctly. Apart from Vladimir and Hongze, who already spend quite
> some
> >>>> time on these, the rest of us should also jump in and try to help.
> >>>>
> >>>> Best,
> >>>> Stamatis
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h....@alibaba-inc.com>
> >>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I found there are some RelCompositeTrait related issues:
> >>>>> https://issues.apache.org/jira/browse/CALCITE-2010
> >>>>> https://issues.apache.org/jira/browse/CALCITE-2593
> >>>>> https://issues.apache.org/jira/browse/CALCITE-2764
> >>>>>
> >>>>> Multi-sorted table are rare in pratice, mutil-distributed table
> doesn't
> >>>>> exist either. Values node with several tuples is not worth
> optimization,
> >>>>> with many tuples is not worth optimization either, because the time
> it
> >>>>> takes optimizer to figure out the ordering may be longer than just
> sort
> >>> it
> >>>>> in runtime.
> >>>>>
> >>>>> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
> >>>>> Leo extended RelDistribution to inherit RelMultipleTrait, just like
> >>>>> RelCollation does, to solve his problem in the example. But I don't
> >>> think
> >>>>> this is an appropriate way to represent the equivalence classes (in
> >>>>> PostgreSQL's term).
> >>>>>
> >>>>> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
> >>>>> beginning? Seems like it gives us more pain than gain.
> >>>>>
> >>>>> Thanks ~
> >>>>> Haisheng Yuan
> >>>>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] RelCompositeTrait

Posted by Hongze Zhang <no...@126.com>.
You are right, removing the collations could only workaround what causes us to find the issue on multi-sort. Maybe we'd better not to remove them (at least currently) since they provide a way to easily test against composite trait.

Hongze

> On Apr 17, 2019, at 01:15, Haisheng Yuan <h....@alibaba-inc.com> wrote:
> 
>> it looks like if we want to get these problems fixed quickly we can just remove
>> EnumerableValues's collation emitting.
> 
> I am afraid even removing Values collation enumeration won't actually give it a quick fix,
> because Multi-sorted table, if there is, might still encounter the same issue with Values.
> 
> 
> Thanks ~
> Haisheng Yuan
> ------------------------------------------------------------------
> 发件人:Hongze Zhang<no...@126.com>
> 日 期:2019年04月16日 18:10:17
> 收件人:<de...@calcite.apache.org>
> 主 题:Re: [DISCUSS] RelCompositeTrait
> 
> If we minimize the issue scope to Calcite itself, I think the 3 JIRA 
> tickets: CALCITE-2010, CALCITE-2593, CALCITE-2764 that Haisheng has 
> listed (thanks, Haisheng!) are all related to the multi-sorted 
> EnumerableValue more or less. An it looks like if we want to get these 
> problems fixed quickly we can just remove EnumerableValues's collation 
> emitting. I recall that (correct me if I am wrong) the rel is even not 
> able to emit descending collations, so I suppose it is not perfect at 
> first.
> 
> And another discussion is about enumerating traits. IMHO it's hard to 
> tell Calcite didn't really try avoiding enumerating them already. The 
> methods RelCollationImpl#satisfies[1] and RelDistributions#satisfies[2] 
> already did a job of testing the relationship between traits without 
> checking equality. So the whole thing is looking like we already tried 
> to not to enumerate them but failed at last.
> 
> Regarding the composite traits, one embarrassing thing I can see so far 
> is about the method RelTraitSet#simplify[3]. The JavaDoc says the method 
> is to "return a trait set similar to this one but with all composite 
> traits flattened". But when we look into the related implementations 
> RelCollationTraitDef#getDefault[4]/RelDistributionTraitDef#getDefault[5], 
> they seem not to flatten anything, the traits just simply get wiped. 
> This causes me to worry about if it is really correct that 
> RelCollation/RelDistribution extends RelMultipleTrait, because we can't 
> leverage the trait simplification but are actually hurt by it. If a rel 
> loses it's physical property, we can never prevent from adding 
> unnecessary sorts/exchanges.
> 
> Besides, even if we decide to add some extra sorts/exchanges, that is 
> somehow not easy so far. See CALCITE-2592/CALCITE-2970, the planner is 
> not that smooth to automatically add them.
> 
> Overall, regarding these "small" problems, I think none of them is 
> really impossible to be solved (yes coming up right solutions may be not 
> that straightforward). But of course in future if a brand new design can 
> be proposed to improve the entire trait system (such as avoid 
> enumerating traits), I think that would be totally a great thing.
> 
> Best,
> Hongze
> 
> 
> [1]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationImpl.java#L118
> [2]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributions.java#L143
> [3]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/plan/RelTraitSet.java#L526-L538
> [4]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationTraitDef.java#L58-L60
> [5]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributionTraitDef.java#L47-L49
> 
> ------ Original Message ------
> From: "Haisheng Yuan" <h....@alibaba-inc.com>
> To: "Jacques Nadeau" <ja...@apache.org>; "Apache Calcite dev list" 
> <de...@calcite.apache.org>
> Sent: 2019/4/15 12:27:04
> Subject: Re: Re: [DISCUSS] RelCompositeTrait
> 
>>> There are major challenges with asking for particular traits as well.
>> Imagine a desired aggregate on 7 columns. What does the requestor request
>> with regards to distribution? All seven columns? One column? Some
>> combination in between?
>> 
>> The same challenges exist for enumerating all the traits as well. Imagine
>> there is an order by the 7 grouping keys on top of the aggregate on 7 columns,
>> but with different sort direction:
>> select * from foo group by a,b,c... order by c desc, a asc, b desc...
>> What sort order, direction should the sort-based stream aggregate provide?
>> All ascending, all descending, order (a,b,c...), order(..c,b,a), or all the combination?
>> All of those enumerated traits are useless except one; for others, additional
>> sort operator will be needed.
>> 
>> Another example is aggregate on top of join, where join on 7 keys, and aggregate
>> on 2 of the join keys. In distributed system, what distribution trait would the join
>> operator provide? The 2 grouping keys? All the join keys? All the combination?
>> 
>> Enumerating some/all the deliverable traits, is not prupose driven. All the traits
>> may be just useless for parent operator. On the other hand, asking the child
>> operator particular traits, is purpose driven, at least the traits asked by parent
>> operator are worth consideration, not as wasteful as the former.
>> 
>> If I understand RelCompositeTrait's intent correctly, the enumerated traits, no
>> matter some combination or all combination, should be saved here. But in fact,
>> it seems not. And as Jacques mentioned, many people rely on RelMetadata
>> operations to pull up the traitsets through operators.
>> 
>> This makes me curious and wonder if there are any true use cases or systems
>> who rely on RelCompositeTrait. If someone has the story, we would love to hear.
>> 
>> Put that aside, even RelCompositeTrait is indispensible, why do we bother optimizing
>> Values node? For values with several tuples, it is not worth optimization, with
>> many tuples, it may take more time to enumerate the RelCollation than just sorting it.
>> Specifically for Values with 0 or 1 tuple, but with many columns, it is definitely not
>> worth the optimization, because sort removal rule and empty rel removal rule should
>> do the work.
>> 
>> 
>> Thanks ~
>> Haisheng Yuan
>> ------------------------------------------------------------------
>> 发件人:Jacques Nadeau<ja...@apache.org>
>> 日 期:2019年04月15日 07:36:51
>> 收件人:<de...@calcite.apache.org>
>> 主 题:Re: [DISCUSS] RelCompositeTrait
>> 
>> There are major challenges with asking for particular traits as well.
>> Imagine a desired aggregate on 7 columns. What does the requestor request
>> with regards to distribution? All seven columns? One column? Some
>> combination in between? The trait system in Calcite is very challenging to
>> work with because it is up to downstream users to try to figure out trait
>> propagation outside the core. So challenging, that I believe that many
>> people move to relying on RelMetadata operations since those can be pulled
>> across several operators at once.
>> 
>> It would be great if someone could spend the time to come up with a more
>> global design for these items and we avoid solving one-off problems.
>> Rationalizing when something should be trait, how to avoid trait planning
>> cost explosion, how to propagate, when something should be handled via
>> RelMetadataQuery, when something should be managed via traits versus
>> materialized view alternatives, etc.
>> 
>> An example of overlapping functionality I'd start with is: should
>> multitraits for collation really exist or would exposing these as
>> materialized view alternatives be more appropriate? Why is it necessary to
>> have a 'shortcut' for this situation while other alternatives don't have
>> one?
>> 
>> 
>> 
>> On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <jh...@apache.org> wrote:
>> 
>>> It seemed reasonable when I introduced it, and seems very reasonable, that
>>> a relational expression (even in the relational model) can have multiple
>>> physical properties. Consider these questions that the planner might ask:
>>> 
>>> Example 1:
>>> 
>>> “Are you sorted on hiredate?”
>>> “Yes”
>>> “Are you sorted on empno?”
>>> “Yes”
>>> “Are you sorted on deptno?”
>>> “No”
>>> 
>>> Example 2:
>>> 
>>> “Can you fit into less than 100MB of memory?”
>>> “Yes”
>>> “Can you fit into less than 10MB of memory?”
>>> “Yes”
>>> “Can you fit into less than 1MB of memory?”
>>> “No”
>>> 
>>> We manage traits like those in example 1 using RelCompositeTrait. We can’t
>>> handle traits like this in example 2, and so we have trained ourselves to
>>> not think of “can fit into memory X” as a trait at all.
>>> 
>>> Perhaps our mistake is to have an API “tell me all of your traits” rather
>>> than an API “do you have trait X?”. Asking a RelNode to enumerate its
>>> traits can be painful: the extreme case is an empty Values with 100
>>> columns; it satisfies any sort order, and there are 100! of these.
>>> 
>>> Julian
>>> 
>>> 
>>> 
>>>> On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <za...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi Haisheng,
>>>> 
>>>> Thanks for raising awareness around this topic. I also think we should
>>> try
>>>> to find a solution.
>>>> 
>>>> Initially, the Volcano planner was designed to be able to cover multiple
>>>> models (and not only the relational). For non-relational models composite
>>>> traits may be indispensable. I don't know if there are people in this
>>> list
>>>> that are using the planner for other models but if there are it would be
>>>> nice to hear from them.
>>>> 
>>>> Focusing exclusively on the relational model, I think composite traits
>>> are
>>>> useful. One use-case that comes to my mind is data replication. It
>>>> perfectly makes sense to partition (distribute) your table on two (or
>>> more)
>>>> columns to be able execute efficiently queries using special partition
>>>> joins. A concrete use-case is RDF data where many distributed systems
>>> store
>>>> the triples table partitioned by subject and object. I guess such
>>> use-cases
>>>> could possibly be modelled in other ways but composite traits is what
>>> comes
>>>> naturally to my mind.
>>>> 
>>>> Regarding multi-sorted tables it is not that rare if you import sorted
>>> data
>>>> into a table with an auto-increment primary key for example.
>>>> 
>>>> I think all the trait-related issues can be solved if we prioritize them
>>>> correctly. Apart from Vladimir and Hongze, who already spend quite some
>>>> time on these, the rest of us should also jump in and try to help.
>>>> 
>>>> Best,
>>>> Stamatis
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h....@alibaba-inc.com>
>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I found there are some RelCompositeTrait related issues:
>>>>> https://issues.apache.org/jira/browse/CALCITE-2010
>>>>> https://issues.apache.org/jira/browse/CALCITE-2593
>>>>> https://issues.apache.org/jira/browse/CALCITE-2764
>>>>> 
>>>>> Multi-sorted table are rare in pratice, mutil-distributed table doesn't
>>>>> exist either. Values node with several tuples is not worth optimization,
>>>>> with many tuples is not worth optimization either, because the time it
>>>>> takes optimizer to figure out the ordering may be longer than just sort
>>> it
>>>>> in runtime.
>>>>> 
>>>>> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
>>>>> Leo extended RelDistribution to inherit RelMultipleTrait, just like
>>>>> RelCollation does, to solve his problem in the example. But I don't
>>> think
>>>>> this is an appropriate way to represent the equivalence classes (in
>>>>> PostgreSQL's term).
>>>>> 
>>>>> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
>>>>> beginning? Seems like it gives us more pain than gain.
>>>>> 
>>>>> Thanks ~
>>>>> Haisheng Yuan
>>>>> 
>>> 
>>> 
>> 


Re: Re: [DISCUSS] RelCompositeTrait

Posted by Vladimir Sitnikov <si...@gmail.com>.
> Regarding the composite traits, one embarrassing thing I can see so far
> is about the method RelTraitSet#simplify[3].
>the traits just simply get wiped.

Exactly.
Of course no one knows how it should work, however making
"RelTraitSet#simplify" a no-op does heal certain cases,
and it looks like it does not make things worse.

I proposed that solution in
https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16750377

Julian offered his help to find the solution on 23Jan (see 2593).
Apparently Julian is not active with regard to 2593 / composites, so I'm
inlined to commit the fix.

Vladimir

Re: Re: [DISCUSS] RelCompositeTrait

Posted by Haisheng Yuan <h....@alibaba-inc.com>.
> it looks like if we want to get these problems fixed quickly we can just remove
> EnumerableValues's collation emitting.

I am afraid even removing Values collation enumeration won't actually give it a quick fix,
 because Multi-sorted table, if there is, might still encounter the same issue with Values.


Thanks ~
Haisheng Yuan
------------------------------------------------------------------
发件人:Hongze Zhang<no...@126.com>
日 期:2019年04月16日 18:10:17
收件人:<de...@calcite.apache.org>
主 题:Re: [DISCUSS] RelCompositeTrait

If we minimize the issue scope to Calcite itself, I think the 3 JIRA 
tickets: CALCITE-2010, CALCITE-2593, CALCITE-2764 that Haisheng has 
listed (thanks, Haisheng!) are all related to the multi-sorted 
EnumerableValue more or less. An it looks like if we want to get these 
problems fixed quickly we can just remove EnumerableValues's collation 
emitting. I recall that (correct me if I am wrong) the rel is even not 
able to emit descending collations, so I suppose it is not perfect at 
first.

And another discussion is about enumerating traits. IMHO it's hard to 
tell Calcite didn't really try avoiding enumerating them already. The 
methods RelCollationImpl#satisfies[1] and RelDistributions#satisfies[2] 
already did a job of testing the relationship between traits without 
checking equality. So the whole thing is looking like we already tried 
to not to enumerate them but failed at last.

Regarding the composite traits, one embarrassing thing I can see so far 
is about the method RelTraitSet#simplify[3]. The JavaDoc says the method 
is to "return a trait set similar to this one but with all composite 
traits flattened". But when we look into the related implementations 
RelCollationTraitDef#getDefault[4]/RelDistributionTraitDef#getDefault[5], 
they seem not to flatten anything, the traits just simply get wiped. 
This causes me to worry about if it is really correct that 
RelCollation/RelDistribution extends RelMultipleTrait, because we can't 
leverage the trait simplification but are actually hurt by it. If a rel 
loses it's physical property, we can never prevent from adding 
unnecessary sorts/exchanges.

Besides, even if we decide to add some extra sorts/exchanges, that is 
somehow not easy so far. See CALCITE-2592/CALCITE-2970, the planner is 
not that smooth to automatically add them.

Overall, regarding these "small" problems, I think none of them is 
really impossible to be solved (yes coming up right solutions may be not 
that straightforward). But of course in future if a brand new design can 
be proposed to improve the entire trait system (such as avoid 
enumerating traits), I think that would be totally a great thing.

Best,
Hongze


[1]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationImpl.java#L118
[2]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributions.java#L143
[3]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/plan/RelTraitSet.java#L526-L538
[4]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationTraitDef.java#L58-L60
[5]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributionTraitDef.java#L47-L49

------ Original Message ------
From: "Haisheng Yuan" <h....@alibaba-inc.com>
To: "Jacques Nadeau" <ja...@apache.org>; "Apache Calcite dev list" 
<de...@calcite.apache.org>
Sent: 2019/4/15 12:27:04
Subject: Re: Re: [DISCUSS] RelCompositeTrait

>> There are major challenges with asking for particular traits as well.
>Imagine a desired aggregate on 7 columns. What does the requestor request
>with regards to distribution? All seven columns? One column? Some
>combination in between?
>
>The same challenges exist for enumerating all the traits as well. Imagine
>there is an order by the 7 grouping keys on top of the aggregate on 7 columns,
>but with different sort direction:
>select * from foo group by a,b,c... order by c desc, a asc, b desc...
>What sort order, direction should the sort-based stream aggregate provide?
>All ascending, all descending, order (a,b,c...), order(..c,b,a), or all the combination?
>All of those enumerated traits are useless except one; for others, additional
>sort operator will be needed.
>
>Another example is aggregate on top of join, where join on 7 keys, and aggregate
>on 2 of the join keys. In distributed system, what distribution trait would the join
>operator provide? The 2 grouping keys? All the join keys? All the combination?
>
>Enumerating some/all the deliverable traits, is not prupose driven. All the traits
>may be just useless for parent operator. On the other hand, asking the child
>operator particular traits, is purpose driven, at least the traits asked by parent
>operator are worth consideration, not as wasteful as the former.
>
>If I understand RelCompositeTrait's intent correctly, the enumerated traits, no
>matter some combination or all combination, should be saved here. But in fact,
>it seems not. And as Jacques mentioned, many people rely on RelMetadata
>operations to pull up the traitsets through operators.
>
>This makes me curious and wonder if there are any true use cases or systems
>who rely on RelCompositeTrait. If someone has the story, we would love to hear.
>
>Put that aside, even RelCompositeTrait is indispensible, why do we bother optimizing
>Values node? For values with several tuples, it is not worth optimization, with
>many tuples, it may take more time to enumerate the RelCollation than just sorting it.
>Specifically for Values with 0 or 1 tuple, but with many columns, it is definitely not
>worth the optimization, because sort removal rule and empty rel removal rule should
>do the work.
>
>
>Thanks ~
>Haisheng Yuan
>------------------------------------------------------------------
>发件人:Jacques Nadeau<ja...@apache.org>
>日 期:2019年04月15日 07:36:51
>收件人:<de...@calcite.apache.org>
>主 题:Re: [DISCUSS] RelCompositeTrait
>
>There are major challenges with asking for particular traits as well.
>Imagine a desired aggregate on 7 columns. What does the requestor request
>with regards to distribution? All seven columns? One column? Some
>combination in between? The trait system in Calcite is very challenging to
>work with because it is up to downstream users to try to figure out trait
>propagation outside the core. So challenging, that I believe that many
>people move to relying on RelMetadata operations since those can be pulled
>across several operators at once.
>
>It would be great if someone could spend the time to come up with a more
>global design for these items and we avoid solving one-off problems.
>Rationalizing when something should be trait, how to avoid trait planning
>cost explosion, how to propagate, when something should be handled via
>RelMetadataQuery, when something should be managed via traits versus
>materialized view alternatives, etc.
>
>An example of overlapping functionality I'd start with is: should
>multitraits for collation really exist or would exposing these as
>materialized view alternatives be more appropriate? Why is it necessary to
>have a 'shortcut' for this situation while other alternatives don't have
>one?
>
>
>
>On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <jh...@apache.org> wrote:
>
>> It seemed reasonable when I introduced it, and seems very reasonable, that
>> a relational expression (even in the relational model) can have multiple
>> physical properties. Consider these questions that the planner might ask:
>>
>> Example 1:
>>
>> “Are you sorted on hiredate?”
>> “Yes”
>> “Are you sorted on empno?”
>> “Yes”
>> “Are you sorted on deptno?”
>> “No”
>>
>> Example 2:
>>
>> “Can you fit into less than 100MB of memory?”
>> “Yes”
>> “Can you fit into less than 10MB of memory?”
>> “Yes”
>> “Can you fit into less than 1MB of memory?”
>> “No”
>>
>> We manage traits like those in example 1 using RelCompositeTrait. We can’t
>> handle traits like this in example 2, and so we have trained ourselves to
>> not think of “can fit into memory X” as a trait at all.
>>
>> Perhaps our mistake is to have an API “tell me all of your traits” rather
>> than an API “do you have trait X?”. Asking a RelNode to enumerate its
>> traits can be painful: the extreme case is an empty Values with 100
>> columns; it satisfies any sort order, and there are 100! of these.
>>
>> Julian
>>
>>
>>
>> > On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <za...@gmail.com>
>> wrote:
>> >
>> > Hi Haisheng,
>> >
>> > Thanks for raising awareness around this topic. I also think we should
>> try
>> > to find a solution.
>> >
>> > Initially, the Volcano planner was designed to be able to cover multiple
>> > models (and not only the relational). For non-relational models composite
>> > traits may be indispensable. I don't know if there are people in this
>> list
>> > that are using the planner for other models but if there are it would be
>> > nice to hear from them.
>> >
>> > Focusing exclusively on the relational model, I think composite traits
>> are
>> > useful. One use-case that comes to my mind is data replication. It
>> > perfectly makes sense to partition (distribute) your table on two (or
>> more)
>> > columns to be able execute efficiently queries using special partition
>> > joins. A concrete use-case is RDF data where many distributed systems
>> store
>> > the triples table partitioned by subject and object. I guess such
>> use-cases
>> > could possibly be modelled in other ways but composite traits is what
>> comes
>> > naturally to my mind.
>> >
>> > Regarding multi-sorted tables it is not that rare if you import sorted
>> data
>> > into a table with an auto-increment primary key for example.
>> >
>> > I think all the trait-related issues can be solved if we prioritize them
>> > correctly. Apart from Vladimir and Hongze, who already spend quite some
>> > time on these, the rest of us should also jump in and try to help.
>> >
>> > Best,
>> > Stamatis
>> >
>> >
>> >
>> >
>> > On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h....@alibaba-inc.com>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> I found there are some RelCompositeTrait related issues:
>> >> https://issues.apache.org/jira/browse/CALCITE-2010
>> >> https://issues.apache.org/jira/browse/CALCITE-2593
>> >> https://issues.apache.org/jira/browse/CALCITE-2764
>> >>
>> >> Multi-sorted table are rare in pratice, mutil-distributed table doesn't
>> >> exist either. Values node with several tuples is not worth optimization,
>> >> with many tuples is not worth optimization either, because the time it
>> >> takes optimizer to figure out the ordering may be longer than just sort
>> it
>> >> in runtime.
>> >>
>> >> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
>> >> Leo extended RelDistribution to inherit RelMultipleTrait, just like
>> >> RelCollation does, to solve his problem in the example. But I don't
>> think
>> >> this is an appropriate way to represent the equivalence classes (in
>> >> PostgreSQL's term).
>> >>
>> >> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
>> >> beginning? Seems like it gives us more pain than gain.
>> >>
>> >> Thanks ~
>> >> Haisheng Yuan
>> >>
>>
>>
>

Re: [DISCUSS] RelCompositeTrait

Posted by Hongze Zhang <no...@126.com>.
If we minimize the issue scope to Calcite itself, I think the 3 JIRA 
tickets: CALCITE-2010, CALCITE-2593, CALCITE-2764 that Haisheng has 
listed (thanks, Haisheng!) are all related to the multi-sorted 
EnumerableValue more or less. An it looks like if we want to get these 
problems fixed quickly we can just remove EnumerableValues's collation 
emitting. I recall that (correct me if I am wrong) the rel is even not 
able to emit descending collations, so I suppose it is not perfect at 
first.

And another discussion is about enumerating traits. IMHO it's hard to 
tell Calcite didn't really try avoiding enumerating them already. The 
methods RelCollationImpl#satisfies[1] and RelDistributions#satisfies[2] 
already did a job of testing the relationship between traits without 
checking equality. So the whole thing is looking like we already tried 
to not to enumerate them but failed at last.

Regarding the composite traits, one embarrassing thing I can see so far 
is about the method RelTraitSet#simplify[3]. The JavaDoc says the method 
is to "return a trait set similar to this one but with all composite 
traits flattened". But when we look into the related implementations 
RelCollationTraitDef#getDefault[4]/RelDistributionTraitDef#getDefault[5], 
they seem not to flatten anything, the traits just simply get wiped. 
This causes me to worry about if it is really correct that 
RelCollation/RelDistribution extends RelMultipleTrait, because we can't 
leverage the trait simplification but are actually hurt by it. If a rel 
loses it's physical property, we can never prevent from adding 
unnecessary sorts/exchanges.

Besides, even if we decide to add some extra sorts/exchanges, that is 
somehow not easy so far. See CALCITE-2592/CALCITE-2970, the planner is 
not that smooth to automatically add them.

Overall, regarding these "small" problems, I think none of them is 
really impossible to be solved (yes coming up right solutions may be not 
that straightforward). But of course in future if a brand new design can 
be proposed to improve the entire trait system (such as avoid 
enumerating traits), I think that would be totally a great thing.

Best,
Hongze


[1]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationImpl.java#L118
[2]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributions.java#L143
[3]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/plan/RelTraitSet.java#L526-L538
[4]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationTraitDef.java#L58-L60
[5]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributionTraitDef.java#L47-L49

------ Original Message ------
From: "Haisheng Yuan" <h....@alibaba-inc.com>
To: "Jacques Nadeau" <ja...@apache.org>; "Apache Calcite dev list" 
<de...@calcite.apache.org>
Sent: 2019/4/15 12:27:04
Subject: Re: Re: [DISCUSS] RelCompositeTrait

>>  There are major challenges with asking for particular traits as well.
>Imagine a desired aggregate on 7 columns. What does the requestor request
>with regards to distribution? All seven columns? One column? Some
>combination in between?
>
>The same challenges exist for enumerating all the traits as well. Imagine
>there is an order by the 7 grouping keys on top of the aggregate on 7 columns,
>but with different sort direction:
>select * from foo group by a,b,c... order by c desc, a asc, b desc...
>What sort order, direction should the sort-based stream aggregate provide?
>All ascending, all descending, order (a,b,c...), order(..c,b,a), or all the combination?
>All of those enumerated traits are useless except one; for others, additional
>sort operator will be needed.
>
>Another example is aggregate on top of join, where join on 7 keys, and aggregate
>on 2 of the join keys. In distributed system, what distribution trait would the join
>operator provide? The 2 grouping keys? All the join keys? All the combination?
>
>Enumerating some/all the deliverable traits, is not prupose driven. All the traits
>may be just useless for parent operator. On the other hand, asking the child
>operator particular traits, is purpose driven, at least the traits asked by parent
>operator are worth consideration, not as wasteful as the former.
>
>If I understand RelCompositeTrait's intent correctly, the enumerated traits, no
>matter some combination or all combination, should be saved here. But in fact,
>it seems not. And as Jacques mentioned, many people rely on RelMetadata
>operations to pull up the traitsets through operators.
>
>This makes me curious and wonder if there are any true use cases or systems
>who rely on RelCompositeTrait. If someone has the story, we would love to hear.
>
>Put that aside, even RelCompositeTrait is indispensible, why do we bother optimizing
>Values node? For values with several tuples, it is not worth optimization, with
>many tuples, it may take more time to enumerate the RelCollation than just sorting it.
>Specifically for Values with 0 or 1 tuple, but with many columns, it is definitely not
>worth the optimization, because sort removal rule and empty rel removal rule should
>do the work.
>
>
>Thanks ~
>Haisheng Yuan
>------------------------------------------------------------------
>发件人:Jacques Nadeau<ja...@apache.org>
>日 期:2019年04月15日 07:36:51
>收件人:<de...@calcite.apache.org>
>主 题:Re: [DISCUSS] RelCompositeTrait
>
>There are major challenges with asking for particular traits as well.
>Imagine a desired aggregate on 7 columns. What does the requestor request
>with regards to distribution? All seven columns? One column? Some
>combination in between? The trait system in Calcite is very challenging to
>work with because it is up to downstream users to try to figure out trait
>propagation outside the core. So challenging, that I believe that many
>people move to relying on RelMetadata operations since those can be pulled
>across several operators at once.
>
>It would be great if someone could spend the time to come up with a more
>global design for these items and we avoid solving one-off problems.
>Rationalizing when something should be trait, how to avoid trait planning
>cost explosion, how to propagate, when something should be handled via
>RelMetadataQuery, when something should be managed via traits versus
>materialized view alternatives, etc.
>
>An example of overlapping functionality I'd start with is: should
>multitraits for collation really exist or would exposing these as
>materialized view alternatives be more appropriate? Why is it necessary to
>have a 'shortcut' for this situation while other alternatives don't have
>one?
>
>
>
>On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <jh...@apache.org> wrote:
>
>>  It seemed reasonable when I introduced it, and seems very reasonable, that
>>  a relational expression (even in the relational model) can have multiple
>>  physical properties. Consider these questions that the planner might ask:
>>
>>  Example 1:
>>
>>  “Are you sorted on hiredate?”
>>  “Yes”
>>  “Are you sorted on empno?”
>>  “Yes”
>>  “Are you sorted on deptno?”
>>  “No”
>>
>>  Example 2:
>>
>>  “Can you fit into less than 100MB of memory?”
>>  “Yes”
>>  “Can you fit into less than 10MB of memory?”
>>  “Yes”
>>  “Can you fit into less than 1MB of memory?”
>>  “No”
>>
>>  We manage traits like those in example 1 using RelCompositeTrait. We can’t
>>  handle traits like this in example 2, and so we have trained ourselves to
>>  not think of “can fit into memory X” as a trait at all.
>>
>>  Perhaps our mistake is to have an API “tell me all of your traits” rather
>>  than an API “do you have trait X?”. Asking a RelNode to enumerate its
>>  traits can be painful: the extreme case is an empty Values with 100
>>  columns; it satisfies any sort order, and there are 100! of these.
>>
>>  Julian
>>
>>
>>
>>  > On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <za...@gmail.com>
>>  wrote:
>>  >
>>  > Hi Haisheng,
>>  >
>>  > Thanks for raising awareness around this topic. I also think we should
>>  try
>>  > to find a solution.
>>  >
>>  > Initially, the Volcano planner was designed to be able to cover multiple
>>  > models (and not only the relational). For non-relational models composite
>>  > traits may be indispensable. I don't know if there are people in this
>>  list
>>  > that are using the planner for other models but if there are it would be
>>  > nice to hear from them.
>>  >
>>  > Focusing exclusively on the relational model, I think composite traits
>>  are
>>  > useful. One use-case that comes to my mind is data replication. It
>>  > perfectly makes sense to partition (distribute) your table on two (or
>>  more)
>>  > columns to be able execute efficiently queries using special partition
>>  > joins. A concrete use-case is RDF data where many distributed systems
>>  store
>>  > the triples table partitioned by subject and object. I guess such
>>  use-cases
>>  > could possibly be modelled in other ways but composite traits is what
>>  comes
>>  > naturally to my mind.
>>  >
>>  > Regarding multi-sorted tables it is not that rare if you import sorted
>>  data
>>  > into a table with an auto-increment primary key for example.
>>  >
>>  > I think all the trait-related issues can be solved if we prioritize them
>>  > correctly. Apart from Vladimir and Hongze, who already spend quite some
>>  > time on these, the rest of us should also jump in and try to help.
>>  >
>>  > Best,
>>  > Stamatis
>>  >
>>  >
>>  >
>>  >
>>  > On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h....@alibaba-inc.com>
>>  wrote:
>>  >
>>  >> Hi,
>>  >>
>>  >> I found there are some RelCompositeTrait related issues:
>>  >> https://issues.apache.org/jira/browse/CALCITE-2010
>>  >> https://issues.apache.org/jira/browse/CALCITE-2593
>>  >> https://issues.apache.org/jira/browse/CALCITE-2764
>>  >>
>>  >> Multi-sorted table are rare in pratice, mutil-distributed table doesn't
>>  >> exist either. Values node with several tuples is not worth optimization,
>>  >> with many tuples is not worth optimization either, because the time it
>>  >> takes optimizer to figure out the ordering may be longer than just sort
>>  it
>>  >> in runtime.
>>  >>
>>  >> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
>>  >> Leo extended RelDistribution to inherit RelMultipleTrait, just like
>>  >> RelCollation does, to solve his problem in the example. But I don't
>>  think
>>  >> this is an appropriate way to represent the equivalence classes (in
>>  >> PostgreSQL's term).
>>  >>
>>  >> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
>>  >> beginning? Seems like it gives us more pain than gain.
>>  >>
>>  >> Thanks ~
>>  >> Haisheng Yuan
>>  >>
>>
>>
>

Re: Re: [DISCUSS] RelCompositeTrait

Posted by Haisheng Yuan <h....@alibaba-inc.com>.
> There are major challenges with asking for particular traits as well.
Imagine a desired aggregate on 7 columns. What does the requestor request
with regards to distribution? All seven columns? One column? Some
combination in between?

The same challenges exist for enumerating all the traits as well. Imagine
there is an order by the 7 grouping keys on top of the aggregate on 7 columns,
but with different sort direction:
select * from foo group by a,b,c... order by c desc, a asc, b desc...
What sort order, direction should the sort-based stream aggregate provide?
All ascending, all descending, order (a,b,c...), order(..c,b,a), or all the combination?
All of those enumerated traits are useless except one; for others, additional
sort operator will be needed.

Another example is aggregate on top of join, where join on 7 keys, and aggregate
on 2 of the join keys. In distributed system, what distribution trait would the join
operator provide? The 2 grouping keys? All the join keys? All the combination?

Enumerating some/all the deliverable traits, is not prupose driven. All the traits
may be just useless for parent operator. On the other hand, asking the child
operator particular traits, is purpose driven, at least the traits asked by parent
operator are worth consideration, not as wasteful as the former.

If I understand RelCompositeTrait's intent correctly, the enumerated traits, no
matter some combination or all combination, should be saved here. But in fact,
it seems not. And as Jacques mentioned, many people rely on RelMetadata
operations to pull up the traitsets through operators.

This makes me curious and wonder if there are any true use cases or systems
who rely on RelCompositeTrait. If someone has the story, we would love to hear.

Put that aside, even RelCompositeTrait is indispensible, why do we bother optimizing
Values node? For values with several tuples, it is not worth optimization, with
many tuples, it may take more time to enumerate the RelCollation than just sorting it.
Specifically for Values with 0 or 1 tuple, but with many columns, it is definitely not
worth the optimization, because sort removal rule and empty rel removal rule should
do the work.


Thanks ~
Haisheng Yuan
------------------------------------------------------------------
发件人:Jacques Nadeau<ja...@apache.org>
日 期:2019年04月15日 07:36:51
收件人:<de...@calcite.apache.org>
主 题:Re: [DISCUSS] RelCompositeTrait

There are major challenges with asking for particular traits as well.
Imagine a desired aggregate on 7 columns. What does the requestor request
with regards to distribution? All seven columns? One column? Some
combination in between? The trait system in Calcite is very challenging to
work with because it is up to downstream users to try to figure out trait
propagation outside the core. So challenging, that I believe that many
people move to relying on RelMetadata operations since those can be pulled
across several operators at once.

It would be great if someone could spend the time to come up with a more
global design for these items and we avoid solving one-off problems.
Rationalizing when something should be trait, how to avoid trait planning
cost explosion, how to propagate, when something should be handled via
RelMetadataQuery, when something should be managed via traits versus
materialized view alternatives, etc.

An example of overlapping functionality I'd start with is: should
multitraits for collation really exist or would exposing these as
materialized view alternatives be more appropriate? Why is it necessary to
have a 'shortcut' for this situation while other alternatives don't have
one?



On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <jh...@apache.org> wrote:

> It seemed reasonable when I introduced it, and seems very reasonable, that
> a relational expression (even in the relational model) can have multiple
> physical properties. Consider these questions that the planner might ask:
>
> Example 1:
>
> “Are you sorted on hiredate?”
> “Yes”
> “Are you sorted on empno?”
> “Yes”
> “Are you sorted on deptno?”
> “No”
>
> Example 2:
>
> “Can you fit into less than 100MB of memory?”
> “Yes”
> “Can you fit into less than 10MB of memory?”
> “Yes”
> “Can you fit into less than 1MB of memory?”
> “No”
>
> We manage traits like those in example 1 using RelCompositeTrait. We can’t
> handle traits like this in example 2, and so we have trained ourselves to
> not think of “can fit into memory X” as a trait at all.
>
> Perhaps our mistake is to have an API “tell me all of your traits” rather
> than an API “do you have trait X?”. Asking a RelNode to enumerate its
> traits can be painful: the extreme case is an empty Values with 100
> columns; it satisfies any sort order, and there are 100! of these.
>
> Julian
>
>
>
> > On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <za...@gmail.com>
> wrote:
> >
> > Hi Haisheng,
> >
> > Thanks for raising awareness around this topic. I also think we should
> try
> > to find a solution.
> >
> > Initially, the Volcano planner was designed to be able to cover multiple
> > models (and not only the relational). For non-relational models composite
> > traits may be indispensable. I don't know if there are people in this
> list
> > that are using the planner for other models but if there are it would be
> > nice to hear from them.
> >
> > Focusing exclusively on the relational model, I think composite traits
> are
> > useful. One use-case that comes to my mind is data replication. It
> > perfectly makes sense to partition (distribute) your table on two (or
> more)
> > columns to be able execute efficiently queries using special partition
> > joins. A concrete use-case is RDF data where many distributed systems
> store
> > the triples table partitioned by subject and object. I guess such
> use-cases
> > could possibly be modelled in other ways but composite traits is what
> comes
> > naturally to my mind.
> >
> > Regarding multi-sorted tables it is not that rare if you import sorted
> data
> > into a table with an auto-increment primary key for example.
> >
> > I think all the trait-related issues can be solved if we prioritize them
> > correctly. Apart from Vladimir and Hongze, who already spend quite some
> > time on these, the rest of us should also jump in and try to help.
> >
> > Best,
> > Stamatis
> >
> >
> >
> >
> > On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h....@alibaba-inc.com>
> wrote:
> >
> >> Hi,
> >>
> >> I found there are some RelCompositeTrait related issues:
> >> https://issues.apache.org/jira/browse/CALCITE-2010
> >> https://issues.apache.org/jira/browse/CALCITE-2593
> >> https://issues.apache.org/jira/browse/CALCITE-2764
> >>
> >> Multi-sorted table are rare in pratice, mutil-distributed table doesn't
> >> exist either. Values node with several tuples is not worth optimization,
> >> with many tuples is not worth optimization either, because the time it
> >> takes optimizer to figure out the ordering may be longer than just sort
> it
> >> in runtime.
> >>
> >> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
> >> Leo extended RelDistribution to inherit RelMultipleTrait, just like
> >> RelCollation does, to solve his problem in the example. But I don't
> think
> >> this is an appropriate way to represent the equivalence classes (in
> >> PostgreSQL's term).
> >>
> >> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
> >> beginning? Seems like it gives us more pain than gain.
> >>
> >> Thanks ~
> >> Haisheng Yuan
> >>
>
>


Re: [DISCUSS] RelCompositeTrait

Posted by Jacques Nadeau <ja...@apache.org>.
There are major challenges with asking for particular traits as well.
Imagine a desired aggregate on 7 columns. What does the requestor request
with regards to distribution? All seven columns? One column? Some
combination in between? The trait system in Calcite is very challenging to
work with because it is up to downstream users to try to figure out trait
propagation outside the core. So challenging, that I believe that many
people move to relying on RelMetadata operations since those can be pulled
across several operators at once.

It would be great if someone could spend the time to come up with a more
global design for these items and we avoid solving one-off problems.
Rationalizing when something should be trait, how to avoid trait planning
cost explosion, how to propagate, when something should be handled via
RelMetadataQuery, when something should be managed via traits versus
materialized view alternatives, etc.

An example of overlapping functionality I'd start with is: should
multitraits for collation really exist or would exposing these as
materialized view alternatives be more appropriate? Why is it necessary to
have a 'shortcut' for this situation while other alternatives don't have
one?



On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <jh...@apache.org> wrote:

> It seemed reasonable when I introduced it, and seems very reasonable, that
> a relational expression (even in the relational model) can have multiple
> physical properties. Consider these questions that the planner might ask:
>
> Example 1:
>
> “Are you sorted on hiredate?”
> “Yes”
> “Are you sorted on empno?”
> “Yes”
> “Are you sorted on deptno?”
> “No”
>
> Example 2:
>
> “Can you fit into less than 100MB of memory?”
> “Yes”
> “Can you fit into less than 10MB of memory?”
> “Yes”
> “Can you fit into less than 1MB of memory?”
> “No”
>
> We manage traits like those in example 1 using RelCompositeTrait. We can’t
> handle traits like this in example 2, and so we have trained ourselves to
> not think of “can fit into memory X” as a trait at all.
>
> Perhaps our mistake is to have an API “tell me all of your traits” rather
> than an API “do you have trait X?”. Asking a RelNode to enumerate its
> traits can be painful: the extreme case is an empty Values with 100
> columns; it satisfies any sort order, and there are 100! of these.
>
> Julian
>
>
>
> > On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <za...@gmail.com>
> wrote:
> >
> > Hi Haisheng,
> >
> > Thanks for raising awareness around this topic. I also think we should
> try
> > to find a solution.
> >
> > Initially, the Volcano planner was designed to be able to cover multiple
> > models (and not only the relational). For non-relational models composite
> > traits may be indispensable. I don't know if there are people in this
> list
> > that are using the planner for other models but if there are it would be
> > nice to hear from them.
> >
> > Focusing exclusively on the relational model, I think composite traits
> are
> > useful. One use-case that comes to my mind is data replication. It
> > perfectly makes sense to partition (distribute) your table on two (or
> more)
> > columns to be able execute efficiently queries using special partition
> > joins. A concrete use-case is RDF data where many distributed systems
> store
> > the triples table partitioned by subject and object. I guess such
> use-cases
> > could possibly be modelled in other ways but composite traits is what
> comes
> > naturally to my mind.
> >
> > Regarding multi-sorted tables it is not that rare if you import sorted
> data
> > into a table with an auto-increment primary key for example.
> >
> > I think all the trait-related issues can be solved if we prioritize them
> > correctly. Apart from Vladimir and Hongze, who already spend quite some
> > time on these, the rest of us should also jump in and try to help.
> >
> > Best,
> > Stamatis
> >
> >
> >
> >
> > On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h....@alibaba-inc.com>
> wrote:
> >
> >> Hi,
> >>
> >> I found there are some RelCompositeTrait related issues:
> >> https://issues.apache.org/jira/browse/CALCITE-2010
> >> https://issues.apache.org/jira/browse/CALCITE-2593
> >> https://issues.apache.org/jira/browse/CALCITE-2764
> >>
> >> Multi-sorted table are rare in pratice, mutil-distributed table doesn't
> >> exist either. Values node with several tuples is not worth optimization,
> >> with many tuples is not worth optimization either, because the time it
> >> takes optimizer to figure out the ordering may be longer than just sort
> it
> >> in runtime.
> >>
> >> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
> >> Leo extended RelDistribution to inherit RelMultipleTrait, just like
> >> RelCollation does, to solve his problem in the example. But I don't
> think
> >> this is an appropriate way to represent the equivalence classes (in
> >> PostgreSQL's term).
> >>
> >> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
> >> beginning? Seems like it gives us more pain than gain.
> >>
> >> Thanks ~
> >> Haisheng Yuan
> >>
>
>

Re: [DISCUSS] RelCompositeTrait

Posted by Julian Hyde <jh...@apache.org>.
It seemed reasonable when I introduced it, and seems very reasonable, that a relational expression (even in the relational model) can have multiple physical properties. Consider these questions that the planner might ask:

Example 1:

“Are you sorted on hiredate?”
“Yes”
“Are you sorted on empno?”
“Yes”
“Are you sorted on deptno?”
“No”

Example 2:

“Can you fit into less than 100MB of memory?”
“Yes”
“Can you fit into less than 10MB of memory?”
“Yes”
“Can you fit into less than 1MB of memory?”
“No”

We manage traits like those in example 1 using RelCompositeTrait. We can’t handle traits like this in example 2, and so we have trained ourselves to not think of “can fit into memory X” as a trait at all.

Perhaps our mistake is to have an API “tell me all of your traits” rather than an API “do you have trait X?”. Asking a RelNode to enumerate its traits can be painful: the extreme case is an empty Values with 100 columns; it satisfies any sort order, and there are 100! of these.

Julian



> On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <za...@gmail.com> wrote:
> 
> Hi Haisheng,
> 
> Thanks for raising awareness around this topic. I also think we should try
> to find a solution.
> 
> Initially, the Volcano planner was designed to be able to cover multiple
> models (and not only the relational). For non-relational models composite
> traits may be indispensable. I don't know if there are people in this list
> that are using the planner for other models but if there are it would be
> nice to hear from them.
> 
> Focusing exclusively on the relational model, I think composite traits are
> useful. One use-case that comes to my mind is data replication. It
> perfectly makes sense to partition (distribute) your table on two (or more)
> columns to be able execute efficiently queries using special partition
> joins. A concrete use-case is RDF data where many distributed systems store
> the triples table partitioned by subject and object. I guess such use-cases
> could possibly be modelled in other ways but composite traits is what comes
> naturally to my mind.
> 
> Regarding multi-sorted tables it is not that rare if you import sorted data
> into a table with an auto-increment primary key for example.
> 
> I think all the trait-related issues can be solved if we prioritize them
> correctly. Apart from Vladimir and Hongze, who already spend quite some
> time on these, the rest of us should also jump in and try to help.
> 
> Best,
> Stamatis
> 
> 
> 
> 
> On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h....@alibaba-inc.com> wrote:
> 
>> Hi,
>> 
>> I found there are some RelCompositeTrait related issues:
>> https://issues.apache.org/jira/browse/CALCITE-2010
>> https://issues.apache.org/jira/browse/CALCITE-2593
>> https://issues.apache.org/jira/browse/CALCITE-2764
>> 
>> Multi-sorted table are rare in pratice, mutil-distributed table doesn't
>> exist either. Values node with several tuples is not worth optimization,
>> with many tuples is not worth optimization either, because the time it
>> takes optimizer to figure out the ordering may be longer than just sort it
>> in runtime.
>> 
>> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
>> Leo extended RelDistribution to inherit RelMultipleTrait, just like
>> RelCollation does, to solve his problem in the example. But I don't think
>> this is an appropriate way to represent the equivalence classes (in
>> PostgreSQL's term).
>> 
>> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
>> beginning? Seems like it gives us more pain than gain.
>> 
>> Thanks ~
>> Haisheng Yuan
>> 


Re: [DISCUSS] RelCompositeTrait

Posted by Stamatis Zampetakis <za...@gmail.com>.
Hi Haisheng,

Thanks for raising awareness around this topic. I also think we should try
to find a solution.

Initially, the Volcano planner was designed to be able to cover multiple
models (and not only the relational). For non-relational models composite
traits may be indispensable. I don't know if there are people in this list
that are using the planner for other models but if there are it would be
nice to hear from them.

Focusing exclusively on the relational model, I think composite traits are
useful. One use-case that comes to my mind is data replication. It
perfectly makes sense to partition (distribute) your table on two (or more)
columns to be able execute efficiently queries using special partition
joins. A concrete use-case is RDF data where many distributed systems store
the triples table partitioned by subject and object. I guess such use-cases
could possibly be modelled in other ways but composite traits is what comes
naturally to my mind.

Regarding multi-sorted tables it is not that rare if you import sorted data
into a table with an auto-increment primary key for example.

I think all the trait-related issues can be solved if we prioritize them
correctly. Apart from Vladimir and Hongze, who already spend quite some
time on these, the rest of us should also jump in and try to help.

Best,
Stamatis




On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h....@alibaba-inc.com> wrote:

> Hi,
>
> I found there are some RelCompositeTrait related issues:
> https://issues.apache.org/jira/browse/CALCITE-2010
> https://issues.apache.org/jira/browse/CALCITE-2593
> https://issues.apache.org/jira/browse/CALCITE-2764
>
> Multi-sorted table are rare in pratice, mutil-distributed table doesn't
> exist either. Values node with several tuples is not worth optimization,
> with many tuples is not worth optimization either, because the time it
> takes optimizer to figure out the ordering may be longer than just sort it
> in runtime.
>
> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
> Leo extended RelDistribution to inherit RelMultipleTrait, just like
> RelCollation does, to solve his problem in the example. But I don't think
> this is an appropriate way to represent the equivalence classes (in
> PostgreSQL's term).
>
> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
> beginning? Seems like it gives us more pain than gain.
>
> Thanks ~
> Haisheng Yuan
>