You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Vladimir Sitnikov <si...@gmail.com> on 2020/01/04 15:55:54 UTC

[DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

Hi,

I've spent some time on stabilizing the costs (see
https://github.com/apache/calcite/pull/1702/commits ), and it looks like we
might want to have some notion of "cost unit".

For instance, we want to express that sorting table with 2 int columns is
cheaper than sorting table with 22 int columns.
Unfortunately, VolcanoCost is compared by rows field only, so, for now, we
express the number of fields into the cost#rows field by adding something
like "cost += fieldCount * 0.1" :(

Of course, you might say that cost is pluggable, however, I would like to
make the default implementation sane.
At least it should be good enough for inspirational purposes.

What do you think if add a way to convert Cost to double?
For instance, we can add measurer.measure(cost) that would weight the cost
or we can add method like `double cost#toSeconds()`.

I guess, if we add a unit (e.g. microsecond), then we could even
micro-benchmark different join implementations, and use the appropriate
cost values
for extra columns and so on.

I fully understand that the cost does not have to be precise, however, it
is sad to guestimate the multipliers for an extra field in projection.



Just to recap:
1) I've started with making tests parallel <-- this was the main goal
2) Then I run into EnumerableJoinTest#testSortMergeJoinWithEquiCondition
which was using static variables
3) As I fixed EnumerableMergeJoinRule, it turned out the optimizer started
to use merge join all over the place
4) It was caused by inappropriate costing of Sort, which I fixed
5) Then I updated the cost functions of EnumerableHashJoin and
EnumerableNestedLoopJoin, and it was not enough because ReflectiveSchema
was not providing the proper statistics
6) Then I updated ReflectiveSchema and Calc to propagate uniqueness and
rowcount metadata.

All of the above seems to be more-or-less stable (the plans improved!), and
the failing tests, for now, are MaterializationTest.

The problems with those tests are the cost differences between NestedLoop
and HashJoin are tiny.
For instance:
testJoinMaterialization8

EnumerableProject(empid=[$2]): rowcount = 6.6, cumulative cost =
{105.19999999999999 rows, 82.6 cpu, 0.0 io}, id = 780
  EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner]): rowcount =
6.6, cumulative cost = {98.6 rows, 76.0 cpu, 0.0 io}, id = 779
    EnumerableProject(name=[$0], name0=[CAST($0):VARCHAR]): rowcount =
22.0, cumulative cost = {44.0 rows, 67.0 cpu, 0.0 io}, id = 777
      EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative
cost = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152
    EnumerableProject(empid=[$0], name=[$1], name0=[CAST($1):VARCHAR]):
rowcount = 2.0, cumulative cost = {4.0 rows, 9.0 cpu, 0.0 io}, id = 778
      EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125

vs

EnumerableProject(empid=[$0]): rowcount = 6.6, cumulative cost =
{81.19999999999999 rows, 55.6 cpu, 0.0 io}, id = 778
  EnumerableNestedLoopJoin(condition=[=(CAST($1):VARCHAR,
CAST($2):VARCHAR)], joinType=[inner]): rowcount = 6.6, cumulative cost =
{74.6 rows, 49.0 cpu, 0.0 io}, id = 777
    EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125
    EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative cost
= {22.0 rows, 23.0 cpu, 0.0 io}, id = 152

The second plan looks cheaper to the optimizer, however, the key difference
comes from three projects in the first plan (project account for
6.6+22+2=30.6 cost).
If I increase hr.dependents table to 3 rows, then hash-based plan becomes
cheaper.

As for me both plans looks acceptable, however, it is sad to analyze/debug
those differences without being able to tell if that is a plan degradation
or if it is acceptable.

Vladimir

Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

Posted by Vladimir Sitnikov <si...@gmail.com>.

Michael>If we want to calibrate

A part of the question is "What should Aggregate#computeSelfCost return?"

A) There's an option to make that method abstract so every sub-class
defines its own cost implementation.
It might be sad, and it might look like a NLogN duplication all over the
place.

B) On the other hand, if there's a generic implementation in Aggregate
class, then it should be aligned
with Enumerable* / Bindable* / Cassandra* and so on costs.

For instance, if EnumerableHashJoin uses "0.5 units per hashed cell", then
Aggregate should use the same units,
so the optimizer can properly tell the cost difference between
EnumerableHashSemiJoin(A, B) vs EnumerableHashJoin(A, Aggregate(B))

Vladimir

Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

Posted by Michael Mior <mm...@apache.org>.

Having some kind of calibration that could run would be nice :) I
suppose single block read times on HDDs are likely stable, but wee
don't really know where the data is coming from. It could an HDD, an
SDD or even a network service with variable latency. So I'm not
convinced we'll ever get estimations that are really accurate. That's
why I'm suggesting we just call things cost units and leave it at
that. If we want to calibrate, what really matters is the ratio
between CPU time and data read time.
--
Michael Mior
mmior@apache.org

Le sam. 4 janv. 2020 à 15:10, Vladimir Sitnikov
<si...@gmail.com> a écrit :
>
> Technically speaking, single-block read time for HDDs is pretty much
> stable, so the use of seconds might be not that bad.
> However, it seconds might be complicated to measure CPU-like activity (e.g.
> different machines might execute EnumerableJoin at different rate :( )
>
>
> What if we benchmark a trivial EnumerableCalc(EnumerableTableScan) for a
> table of 100 rows and 10 columns
> and call it a single cost unit?
>
> In other words, we could have an etalon benchmark that takes X seconds and
> we could call it a single cost unit.
>
> For instance, org.apache.calcite.rel.core.Sort#computeSelfCost returns a
> cost.
> Of course, it has NLogN assumption, but which multiplier should it use?
>
> One could measure the wallclock time for the sort, and divide it by the
> time it takes to execute the etalon cost benchmark.
>
> WDYT?
>
> Vladimir

Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

Posted by Vladimir Sitnikov <si...@gmail.com>.

Technically speaking, single-block read time for HDDs is pretty much
stable, so the use of seconds might be not that bad.
However, it seconds might be complicated to measure CPU-like activity (e.g.
different machines might execute EnumerableJoin at different rate :( )


What if we benchmark a trivial EnumerableCalc(EnumerableTableScan) for a
table of 100 rows and 10 columns
and call it a single cost unit?

In other words, we could have an etalon benchmark that takes X seconds and
we could call it a single cost unit.

For instance, org.apache.calcite.rel.core.Sort#computeSelfCost returns a
cost.
Of course, it has NLogN assumption, but which multiplier should it use?

One could measure the wallclock time for the sort, and divide it by the
time it takes to execute the etalon cost benchmark.

WDYT?

Vladimir

Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

Posted by Michael Mior <mm...@apache.org>.

I understand the cost doesn't have to match actual execution duration
and it doesn't really matter if it does as long as we can get the
relative ordering of plans roughly similar. That's why I'm suggesting
not calling the cost seconds, even if we are trying to roughly
approximate them. But I don't feel that strongly about this.
--
Michael Mior
mmior@apache.org

Le sam. 4 janv. 2020 à 14:18, Vladimir Sitnikov
<si...@gmail.com> a écrit :
>
> Michael>although I would be hesitant to refer to "seconds"
>
> Do you have better ideas?
> If my memory serves me well, PostgreSQL uses seconds as well for cost units.
> OracleDB is using "singleblock read" for the cost unit.
>
> Michael>how long execution will take on any particular system
>
> The idea for the cost is to be able to compare different plans.
> The cost does not have to match the actual execution duration.
>
> Then there might be tuning knobs like "how much seconds a single io lasts".
>
> Vladimir

Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

Posted by Vladimir Sitnikov <si...@gmail.com>.

Michael>although I would be hesitant to refer to "seconds"

Do you have better ideas?
If my memory serves me well, PostgreSQL uses seconds as well for cost units.
OracleDB is using "singleblock read" for the cost unit.

Michael>how long execution will take on any particular system

The idea for the cost is to be able to compare different plans.
The cost does not have to match the actual execution duration.

Then there might be tuning knobs like "how much seconds a single io lasts".

Vladimir

Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

Posted by Michael Mior <mm...@apache.org>.

A cost unit sounds fine to me, although I would be hesitant to refer
to "seconds" or other concrete measurements since there's no easy way
to guess how long execution will take on any particular system.
--
Michael Mior
mmior@apache.org

Le sam. 4 janv. 2020 à 10:56, Vladimir Sitnikov
<si...@gmail.com> a écrit :
>
> Hi,
>
> I've spent some time on stabilizing the costs (see
> https://github.com/apache/calcite/pull/1702/commits ), and it looks like we
> might want to have some notion of "cost unit".
>
> For instance, we want to express that sorting table with 2 int columns is
> cheaper than sorting table with 22 int columns.
> Unfortunately, VolcanoCost is compared by rows field only, so, for now, we
> express the number of fields into the cost#rows field by adding something
> like "cost += fieldCount * 0.1" :(
>
> Of course, you might say that cost is pluggable, however, I would like to
> make the default implementation sane.
> At least it should be good enough for inspirational purposes.
>
> What do you think if add a way to convert Cost to double?
> For instance, we can add measurer.measure(cost) that would weight the cost
> or we can add method like `double cost#toSeconds()`.
>
> I guess, if we add a unit (e.g. microsecond), then we could even
> micro-benchmark different join implementations, and use the appropriate
> cost values
> for extra columns and so on.
>
> I fully understand that the cost does not have to be precise, however, it
> is sad to guestimate the multipliers for an extra field in projection.
>
>
>
> Just to recap:
> 1) I've started with making tests parallel <-- this was the main goal
> 2) Then I run into EnumerableJoinTest#testSortMergeJoinWithEquiCondition
> which was using static variables
> 3) As I fixed EnumerableMergeJoinRule, it turned out the optimizer started
> to use merge join all over the place
> 4) It was caused by inappropriate costing of Sort, which I fixed
> 5) Then I updated the cost functions of EnumerableHashJoin and
> EnumerableNestedLoopJoin, and it was not enough because ReflectiveSchema
> was not providing the proper statistics
> 6) Then I updated ReflectiveSchema and Calc to propagate uniqueness and
> rowcount metadata.
>
> All of the above seems to be more-or-less stable (the plans improved!), and
> the failing tests, for now, are MaterializationTest.
>
> The problems with those tests are the cost differences between NestedLoop
> and HashJoin are tiny.
> For instance:
> testJoinMaterialization8
>
> EnumerableProject(empid=[$2]): rowcount = 6.6, cumulative cost =
> {105.19999999999999 rows, 82.6 cpu, 0.0 io}, id = 780
>   EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner]): rowcount =
> 6.6, cumulative cost = {98.6 rows, 76.0 cpu, 0.0 io}, id = 779
>     EnumerableProject(name=[$0], name0=[CAST($0):VARCHAR]): rowcount =
> 22.0, cumulative cost = {44.0 rows, 67.0 cpu, 0.0 io}, id = 777
>       EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative
> cost = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152
>     EnumerableProject(empid=[$0], name=[$1], name0=[CAST($1):VARCHAR]):
> rowcount = 2.0, cumulative cost = {4.0 rows, 9.0 cpu, 0.0 io}, id = 778
>       EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
> cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125
>
> vs
>
> EnumerableProject(empid=[$0]): rowcount = 6.6, cumulative cost =
> {81.19999999999999 rows, 55.6 cpu, 0.0 io}, id = 778
>   EnumerableNestedLoopJoin(condition=[=(CAST($1):VARCHAR,
> CAST($2):VARCHAR)], joinType=[inner]): rowcount = 6.6, cumulative cost =
> {74.6 rows, 49.0 cpu, 0.0 io}, id = 777
>     EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
> cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125
>     EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative cost
> = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152
>
> The second plan looks cheaper to the optimizer, however, the key difference
> comes from three projects in the first plan (project account for
> 6.6+22+2=30.6 cost).
> If I increase hr.dependents table to 3 rows, then hash-based plan becomes
> cheaper.
>
> As for me both plans looks acceptable, however, it is sad to analyze/debug
> those differences without being able to tell if that is a plan degradation
> or if it is acceptable.
>
> Vladimir