You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Stamatis Zampetakis <za...@gmail.com> on 2020/01/04 22:28:01 UTC

Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order

Hi Vladimir,

I think we should leave it as it was.

The fact that VolcanoCost does not exploit the cpu and io information is a
problem of that class and not of the Sort#computeSelfCost method. Note that
the VolcanoPlanner can be configured with a RelOptCostFactory which means
that people who use the planner may use another implementation of
RelOptCost which does take into account cpu and io metrics.

Regarding the more general question, I think we should try to make our cost
estimations more realistic in terms of cpu and io and don't try to put
everything in rows as it is the case for various operators.
This of course requires the VolcanoCost to be adapted.

Best,
Stamatis

On Sun, Dec 29, 2019 at 7:33 PM Vladimir Sitnikov <
sitnikov.vladimir@gmail.com> wrote:

> Hi,
>
> I'm inclined to revert
>
> https://github.com/apache/calcite/commit/48a20668647b5a5e86073ef0e9ce206669ad6867
> Motivation can be found in
>
> https://issues.apache.org/jira/browse/CALCITE-1842?focusedCommentId=17004696&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17004696
>
> WDYT?
>
> The question there is Sort#computeSelfCost
> We have (rows, cpu, io) cost fields, however, most of the time we use just
> **rows** to represent the costing.
> For instance, EnumerableHashJoin computes the cost and returns (rows, 0,
> 0).
>
> CALCITE-1842 adjusted Sort costing so it moved NLogN to cpu field, and it
> makes the sorting virtually free
> because the current Volcano is using rows field only when comparing the
> costs.
>
> Unfortunately, CALCITE-1842 has no tests, so I don't really see what was
> the problem.
>
> Vladimir
>

Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order

Posted by Vladimir Sitnikov <si...@gmail.com>.

Stamatis>HepPlanner does not perform cost-based decisions so for most
use-cases I

It seems to have some cost-based decisions, so providing ill-defined cost
seems wrong.
Well, let's see how new VolcanoCost would work, and then we could discuss
Hep.

Stamatis>Although it is nice to account for CPU and IO in the cost model,
it has
Stamatis>been shown [1]

I don't think the article applies here.
Calcite's Volcano optimizer tends to produce lots of Project and Filter
relations,
and it is important to weight them appropriately, otherwise, it would
endlessly generate
dumb projects.

Stamatis>[1] has also an interesting discussion regarding Postgres
Stamatis>cost model.

PostgreSQL cost model has _two_ cost values for each relation: startup cost
and total cost (see https://www.postgresql.org/docs/12/sql-explain.html )
However, the article does not mention the difference between startup and
total cost, thus it does not look like
they properly discuss PostgreSQL cost model.

I'm not sure how often startup cost is useful, but it might help for cases
like "(...) limit 5"
For example, currently, we don't really use "offset/limit" for cost
computation.
And we do not have a clear answer to that.

On the other hand, if each relation could supply its "startup cost", then
the cost of "limit" would be
startupCost + totalCost*(limit / totalRows)

Vladimir

Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order

Posted by Stamatis Zampetakis <za...@gmail.com>.

Thanks for the PR!

HepPlanner does not perform cost-based decisions so for most use-cases I
don't think it matters a lot what is the cost. For more advanced use-cases,
where there might be rules that use CumulativeCost and other kinds of such
metadata, the HepPlanner can be configured with another cost factory.

I don't see a big reason to change the API of RelOptCost#getRows, the
callers who ask for cumulative cost should know what to expect. Apart from
that rejected rows and other metrics can be added if necessary.

Although it is nice to account for CPU and IO in the cost model, it has
been shown [1] that for main memory settings even a simple cost model based
entirely on cardinality estimations can work fine for many queries. For
those interested [1] has also an interesting discussion regarding Postgres
cost model.

Best,
Stamatis

[1] http://www.vldb.org/pvldb/vol9/p204-leis.pdf

On Sun, Jan 5, 2020 at 2:12 PM Vladimir Sitnikov <
sitnikov.vladimir@gmail.com> wrote:

> In the meantime, I've created a PR that updates VolcanoCost:
> https://github.com/apache/calcite/pull/1722
>
> Vladimir
>

Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order

Posted by Vladimir Sitnikov <si...@gmail.com>.

In the meantime, I've created a PR that updates VolcanoCost:
https://github.com/apache/calcite/pull/1722

Vladimir

Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order

Posted by Vladimir Sitnikov <si...@gmail.com>.

>This of course requires the VolcanoCost to be adapted.

What do you think of HepPlanner?
It uses RelOptCostImpl.FACTORY by default which explicitly ignores CPU and
IO cost factors :((

Regarding cost#rows, there's a problem: cost#rows field does not add well
when computing cumulative cost.
What if we put the number of _rejected_ rows to the cost#rows field?

Then the field would have certain meaning:
* If the value is high, the plan is probably rejecting a lot of unrelated
rows, thus it is suboptimal
* Extra Project/Calc nodes won't artificially increase rows in the cost
fields. Currently each Project adds "rows" which is not very good.
* It is clear what to put to the rows field: "rejected rows" is
more-or-less understandable. For Project it would be 0.
* Join/Filter/Calc nodes would show "estimated number of returned rows=X
(from metadataquery), rejected rows=Y (from cost)" which would help
understanding where the time is spent

That is inspired by PostgreSQL's "rows removed by filter" when running
explain analyze (which is statement execution + collecting statistics on
each execution plan node):
http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.2#Explain_improvements

Vladimir

Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order

Posted by Vladimir Sitnikov <si...@gmail.com>.

>I think we should try to make our cost
estimations more realistic in terms of cpu and io and don't try to put
everything in rows as it is the case for various operators.
This of course requires the VolcanoCost to be adapted.

Well. The more I revise costs, the more I incline to that opinion as well.

It looks the comparison should be based on cpu+io*CPU_PER_IO. Then 'rows'
becomes obsolete :-/

It seems rows duplicates rowCount metadata, thus I am inclined to hide
'rows' from toString output. Probably, toString should show
cpu+io*CPU_PER_IO result.

Vladimir