You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Vladimir Sitnikov <si...@gmail.com> on 2020/01/08 15:21:48 UTC

[DISCUSS] propagateCostImprovements vs incremental bestCost maintenance vs metadata

Hi,

As far as I understand, the incremental best/bestCost maintenance at
RelSubset level does not really work.

That issue is triggered a lot in MaterializationTests due to
https://issues.apache.org/jira/browse/CALCITE-3682
(MaterializationService#defineMaterialization
loses information on unique keys)
In other words, materialization does not have uniqueness information, so
when the planner realizes that materialization is connected to the source
table,
it suddenly receives extra metadata which alters cost estimates
dramatically.

Here's the setup:
1) RelSubset assumes that best and bestCost are always maintained
incrementally.
2) If a relation changes (e.g. it is added to a subset), the cost change is
propagated to parentRels (~all the rels that might have that rel as input).

The propagation happens only in case the new rel is a new best (see [1]).
So far it looks ok: if we have the new best, then we propagate to other
parents.
If the new rel is worse than the previous best, why bother with propagation?

== Now comes the issue ==
The newly added rel might easily affect the costs of other rels even if the
rel is not the best in its subset.

Here's how that is possible:
RelMdColumnUniqueness#areColumnsUnique(RelSubset, ...) iterates over all
the rels in the subset,
so even if the newly added rel is not the best, it might happen to
answer areColumnsUnique request
so other cost functions that rely on uniqueness (e.g. cardinality
estimations) would change.

In other words: if the planner somehow realizes a certain subset returns
unique rows, then a join (in a very distant subset) that was supposed to be
M*N
becomes M+N, and its cost greatly reduces even though the subset's best is
not changed.

At this point, I'm inclined that incremental bestCost maintenance is not
really possible.

Any thoughts?

[1]:
https://github.com/apache/calcite/blob/571731b80a58eb095ebac7123285c375e7afff90/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java#L358-L360

Vladimir

Re: [DISCUSS] propagateCostImprovements vs incremental bestCost maintenance vs metadata

Posted by Xiening Dai <xn...@gmail.com>.

That’s why I say it’s hard to solve under current framework design. The example query you provide can be, and should be, optimized during logical transformation phase. At that moment, there shouldn’t be any cost calculation since all we are doing is to explore equivalences. Once the transformation is done, the row count, uniqueness which could affect cost calculation shouldn’t change anymore. 

But with current Calcite design, rules don’t have concepts of different stages, which means a RelNode can be implemented before it’s been fully explored. That’s why we will have problem like this.

> On Jan 8, 2020, at 1:42 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
>> In theory, the cardinality and uniqueness of a RelSubset should never
> changed per definition of equivalent set
> 
> I agree. It is like in theory there is no difference between theory and
> practice :)
> 
> What if we have select empid from (select empid from emps where empid>0)
> where empid<0  ?
> The original logical plans would likely have two filters, and metadata
> would estimate rowcount as 15% * 15% * tableRowCount (or something like
> that).
> 
> Later the optimizer might realize the relation is equivalent to an empty
> relation, and it could refine the row count as 0.
> So cardinality estimates of a set can vary over time, and I don't think we
> can prevent that.
> 
> It would be nice if metadata could identify a fixed point somehow like in
> dataflow algorithms.
> 
>> We should probably fix defineMaterialization() to provide uniqueness info
> in the first place
> 
> I'm not sure that is the only trigger.
> The thing is VolcanoPlanner.isValid is not activated by default, so we do
> not see cost-related assertion errors.
> 
> Vladimir

Re: [DISCUSS] propagateCostImprovements vs incremental bestCost maintenance vs metadata

Posted by Vladimir Sitnikov <si...@gmail.com>.

>In theory, the cardinality and uniqueness of a RelSubset should never
changed per definition of equivalent set

I agree. It is like in theory there is no difference between theory and
practice :)

What if we have select empid from (select empid from emps where empid>0)
where empid<0  ?
The original logical plans would likely have two filters, and metadata
would estimate rowcount as 15% * 15% * tableRowCount (or something like
that).

Later the optimizer might realize the relation is equivalent to an empty
relation, and it could refine the row count as 0.
So cardinality estimates of a set can vary over time, and I don't think we
can prevent that.

It would be nice if metadata could identify a fixed point somehow like in
dataflow algorithms.

>We should probably fix defineMaterialization() to provide uniqueness info
in the first place

I'm not sure that is the only trigger.
The thing is VolcanoPlanner.isValid is not activated by default, so we do
not see cost-related assertion errors.

Vladimir

Re: [DISCUSS] propagateCostImprovements vs incremental bestCost maintenance vs metadata

Posted by Xiening Dai <xn...@gmail.com>.

This is similar to CALCITE-2166 where a RelNode’s best cost could increase after its input RelSubset cardinality is changed. Unfortunately there’s no easy way to fix this with current framework design. In theory, the cardinality and uniqueness of a RelSubset should never changed per definition of equivalent set. The cost propagation logic is built upon such assumption, which I think is fine.

Regarding this particular issue, I think the JIRA is pointing to the right direction. We should probably fix defineMaterialization() to provide uniqueness info in the first place. This would be a much easier fix than updating cost propogation.

> On Jan 8, 2020, at 7:21 AM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> Hi,
> 
> As far as I understand, the incremental best/bestCost maintenance at
> RelSubset level does not really work.
> 
> That issue is triggered a lot in MaterializationTests due to
> https://issues.apache.org/jira/browse/CALCITE-3682
> (MaterializationService#defineMaterialization
> loses information on unique keys)
> In other words, materialization does not have uniqueness information, so
> when the planner realizes that materialization is connected to the source
> table,
> it suddenly receives extra metadata which alters cost estimates
> dramatically.
> 
> Here's the setup:
> 1) RelSubset assumes that best and bestCost are always maintained
> incrementally.
> 2) If a relation changes (e.g. it is added to a subset), the cost change is
> propagated to parentRels (~all the rels that might have that rel as input).
> 
> The propagation happens only in case the new rel is a new best (see [1]).
> So far it looks ok: if we have the new best, then we propagate to other
> parents.
> If the new rel is worse than the previous best, why bother with propagation?
> 
> == Now comes the issue ==
> The newly added rel might easily affect the costs of other rels even if the
> rel is not the best in its subset.
> 
> Here's how that is possible:
> RelMdColumnUniqueness#areColumnsUnique(RelSubset, ...) iterates over all
> the rels in the subset,
> so even if the newly added rel is not the best, it might happen to
> answer areColumnsUnique request
> so other cost functions that rely on uniqueness (e.g. cardinality
> estimations) would change.
> 
> In other words: if the planner somehow realizes a certain subset returns
> unique rows, then a join (in a very distant subset) that was supposed to be
> M*N
> becomes M+N, and its cost greatly reduces even though the subset's best is
> not changed.
> 
> At this point, I'm inclined that incremental bestCost maintenance is not
> really possible.
> 
> Any thoughts?
> 
> [1]:
> https://github.com/apache/calcite/blob/571731b80a58eb095ebac7123285c375e7afff90/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java#L358-L360
> 
> Vladimir