You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Thomas Rebele <tr...@tibco.com.INVALID> on 2020/08/07 13:56:51 UTC

Gather real world statistics about row count, CPU, and IO cost

Hi all,

I'm working on basic query optimization. I once stumbled on the case that
two operators had the same row count but one had a much higher CPU cost.
Unfortunately the default cost model only takes the row count into account
(see [1]). Stamatis had pointed out in another mail that the row count
might be much more important than the other costs [2]. However, if there
are two possible choices with the same row count, we should prefer the one
with the least CPU cost. I'm wondering whether the assumption that a
smaller row count is better in most cases is actually correct. Also, what
is "better" in this context? The query plan with the least execution time?
Maybe there's a plan that is just <10% slower, but consumes much less
CPU/memory/etc.

So I thought about the cost model in general, and how to improve it. I
assume the better the estimated cost corresponds to the real cost, the
better the optimized plans. So the first step would be to collect the real
world statistics and the second step to adapt the cost estimation so that
there's a better correspondence. For the beginning I would just measure how
many rows have been in the result and how much time has passed for each
RelNode during query execution. Is there already a way to do this in
Calcite? Does this make sense at all?

[1]
https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100
[2]
https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf

Cordialement / Best Regards,
*Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris, France
| www.tibco.com

Re: Gather real world statistics about row count, CPU, and IO cost

Posted by Stamatis Zampetakis <za...@gmail.com>.

Any help for improving the cost model is much appreciated.

I am not sure what you have in mind when you mention real-world statistics
but I think you could also start with the stats provided by synthetic
datasets
such as Foodmart, TPC-H and TPC-DS.

Foodmart, TPC-H and TPC-DS come along with a rich set of queries
(especially the latter)
that you could use to demonstrate that the changes in the cost-model are
beneficial.
The advantage is that you can also run the query and validate the
improvement.

Even if you want to emphasize on two table joins you could do that by
taking
subpatterns of the queries appearing in TPC-DS for instance.

In general, apart from the cost-model the default ruleset can be improved
and using the TPC-H/TPC-DS query plans as a baseline could help a lot.

All the datasets mentioned previously and more are already present in the
repo so you can rely on them to get started.

Best,
Stamatis


On Mon, Aug 10, 2020 at 1:48 PM Thomas Rebele <tr...@tibco.com.invalid>
wrote:

> Thank you for your analysis and the references. I would be more interested
> in sub-problem 2.
> A kind of tuning tool would be nice. If a project using Calcite  introduces
> custom operators, it could help to find a cost model for them.
>
> I would start with a simple case, e.g., deciding whether a particular join
> should be translated into a hash join, merge join, or correlate.
> With a different cardinality of the inputs, one algorithm or the other
> might be better. The real-world statistics might help to decide which one
> to choose.
>
> Cordialement / Best Regards,
> *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris,
> France
> | www.tibco.com
>
>
> On Fri, Aug 7, 2020 at 11:32 PM Julian Hyde <jh...@apache.org> wrote:
>
> > Also, check out the paper "Learning to Optimize Join Queries With Deep
> > Reinforcement Learning" by Krishnan, Yang, Goldberg, Hellerstein,
> > Stoica 2019 (which aims to improve the join-order benchmark and uses
> > Calcite as one of its test platforms):
> > https://arxiv.org/pdf/1808.03196.pdf
> >
> > On Fri, Aug 7, 2020 at 2:02 PM Julian Hyde <jh...@apache.org> wrote:
> > >
> > > I consider there to be two fairly independent sub-problems:
> > >
> > > 1. Improve our cardinality estimates (especially for queries with
> > > complex joins and aggregation).
> > >
> > > 2. Calibrate physical operators so that, given a good estimate of the
> > > number of rows they will see, we can come up with a reasonable
> > > estimate of the physical cost (e.g. how long the query will take to
> > > execute, and how much memory).
> > >
> > > For 1, the paper you cite, and the join-order benchmark it introduces,
> > > is an excellent contribution to the field. It inspired me to do work
> > > on profiling [1]. I would encourage you to build on the work I have
> > > already done.
> > >
> > > For 2, I have not done any work personally. An approach would be to
> > > give each physical operator (e.g. EnumerableHashJoin) a cost model
> > > parameterized by certain constants, and then run experiments to
> > > determine the values of those constants empirically. Perhaps we could
> > > write a "TuningTool" that generates an "operator constants file", and
> > > thereby start to formalize the process.
> > >
> > > Julian
> > >
> > > [1]
> > https://www.slideshare.net/julianhyde/data-profiling-in-apache-calcite
> > >
> > > On Fri, Aug 7, 2020 at 6:57 AM Thomas Rebele <trebele@tibco.com.invalid
> >
> > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I'm working on basic query optimization. I once stumbled on the case
> > that
> > > > two operators had the same row count but one had a much higher CPU
> > cost.
> > > > Unfortunately the default cost model only takes the row count into
> > account
> > > > (see [1]). Stamatis had pointed out in another mail that the row
> count
> > > > might be much more important than the other costs [2]. However, if
> > there
> > > > are two possible choices with the same row count, we should prefer
> the
> > one
> > > > with the least CPU cost. I'm wondering whether the assumption that a
> > > > smaller row count is better in most cases is actually correct. Also,
> > what
> > > > is "better" in this context? The query plan with the least execution
> > time?
> > > > Maybe there's a plan that is just <10% slower, but consumes much less
> > > > CPU/memory/etc.
> > > >
> > > > So I thought about the cost model in general, and how to improve it.
> I
> > > > assume the better the estimated cost corresponds to the real cost,
> the
> > > > better the optimized plans. So the first step would be to collect the
> > real
> > > > world statistics and the second step to adapt the cost estimation so
> > that
> > > > there's a better correspondence. For the beginning I would just
> > measure how
> > > > many rows have been in the result and how much time has passed for
> each
> > > > RelNode during query execution. Is there already a way to do this in
> > > > Calcite? Does this make sense at all?
> > > >
> > > > [1]
> > > >
> >
> https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100
> > > > [2]
> > > >
> >
> https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf
> > > >
> > > > Cordialement / Best Regards,
> > > > *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris,
> > France
> > > > | www.tibco.com
> >
>

Re: Gather real world statistics about row count, CPU, and IO cost

Posted by Thomas Rebele <tr...@tibco.com.INVALID>.

Thank you for your analysis and the references. I would be more interested
in sub-problem 2.
A kind of tuning tool would be nice. If a project using Calcite  introduces
custom operators, it could help to find a cost model for them.

I would start with a simple case, e.g., deciding whether a particular join
should be translated into a hash join, merge join, or correlate.
With a different cardinality of the inputs, one algorithm or the other
might be better. The real-world statistics might help to decide which one
to choose.

Cordialement / Best Regards,
*Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris, France
| www.tibco.com


On Fri, Aug 7, 2020 at 11:32 PM Julian Hyde <jh...@apache.org> wrote:

> Also, check out the paper "Learning to Optimize Join Queries With Deep
> Reinforcement Learning" by Krishnan, Yang, Goldberg, Hellerstein,
> Stoica 2019 (which aims to improve the join-order benchmark and uses
> Calcite as one of its test platforms):
> https://arxiv.org/pdf/1808.03196.pdf
>
> On Fri, Aug 7, 2020 at 2:02 PM Julian Hyde <jh...@apache.org> wrote:
> >
> > I consider there to be two fairly independent sub-problems:
> >
> > 1. Improve our cardinality estimates (especially for queries with
> > complex joins and aggregation).
> >
> > 2. Calibrate physical operators so that, given a good estimate of the
> > number of rows they will see, we can come up with a reasonable
> > estimate of the physical cost (e.g. how long the query will take to
> > execute, and how much memory).
> >
> > For 1, the paper you cite, and the join-order benchmark it introduces,
> > is an excellent contribution to the field. It inspired me to do work
> > on profiling [1]. I would encourage you to build on the work I have
> > already done.
> >
> > For 2, I have not done any work personally. An approach would be to
> > give each physical operator (e.g. EnumerableHashJoin) a cost model
> > parameterized by certain constants, and then run experiments to
> > determine the values of those constants empirically. Perhaps we could
> > write a "TuningTool" that generates an "operator constants file", and
> > thereby start to formalize the process.
> >
> > Julian
> >
> > [1]
> https://www.slideshare.net/julianhyde/data-profiling-in-apache-calcite
> >
> > On Fri, Aug 7, 2020 at 6:57 AM Thomas Rebele <tr...@tibco.com.invalid>
> wrote:
> > >
> > > Hi all,
> > >
> > > I'm working on basic query optimization. I once stumbled on the case
> that
> > > two operators had the same row count but one had a much higher CPU
> cost.
> > > Unfortunately the default cost model only takes the row count into
> account
> > > (see [1]). Stamatis had pointed out in another mail that the row count
> > > might be much more important than the other costs [2]. However, if
> there
> > > are two possible choices with the same row count, we should prefer the
> one
> > > with the least CPU cost. I'm wondering whether the assumption that a
> > > smaller row count is better in most cases is actually correct. Also,
> what
> > > is "better" in this context? The query plan with the least execution
> time?
> > > Maybe there's a plan that is just <10% slower, but consumes much less
> > > CPU/memory/etc.
> > >
> > > So I thought about the cost model in general, and how to improve it. I
> > > assume the better the estimated cost corresponds to the real cost, the
> > > better the optimized plans. So the first step would be to collect the
> real
> > > world statistics and the second step to adapt the cost estimation so
> that
> > > there's a better correspondence. For the beginning I would just
> measure how
> > > many rows have been in the result and how much time has passed for each
> > > RelNode during query execution. Is there already a way to do this in
> > > Calcite? Does this make sense at all?
> > >
> > > [1]
> > >
> https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100
> > > [2]
> > >
> https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf
> > >
> > > Cordialement / Best Regards,
> > > *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris,
> France
> > > | www.tibco.com
>

Re: Gather real world statistics about row count, CPU, and IO cost

Posted by Julian Hyde <jh...@apache.org>.

Also, check out the paper "Learning to Optimize Join Queries With Deep
Reinforcement Learning" by Krishnan, Yang, Goldberg, Hellerstein,
Stoica 2019 (which aims to improve the join-order benchmark and uses
Calcite as one of its test platforms):
https://arxiv.org/pdf/1808.03196.pdf

On Fri, Aug 7, 2020 at 2:02 PM Julian Hyde <jh...@apache.org> wrote:
>
> I consider there to be two fairly independent sub-problems:
>
> 1. Improve our cardinality estimates (especially for queries with
> complex joins and aggregation).
>
> 2. Calibrate physical operators so that, given a good estimate of the
> number of rows they will see, we can come up with a reasonable
> estimate of the physical cost (e.g. how long the query will take to
> execute, and how much memory).
>
> For 1, the paper you cite, and the join-order benchmark it introduces,
> is an excellent contribution to the field. It inspired me to do work
> on profiling [1]. I would encourage you to build on the work I have
> already done.
>
> For 2, I have not done any work personally. An approach would be to
> give each physical operator (e.g. EnumerableHashJoin) a cost model
> parameterized by certain constants, and then run experiments to
> determine the values of those constants empirically. Perhaps we could
> write a "TuningTool" that generates an "operator constants file", and
> thereby start to formalize the process.
>
> Julian
>
> [1] https://www.slideshare.net/julianhyde/data-profiling-in-apache-calcite
>
> On Fri, Aug 7, 2020 at 6:57 AM Thomas Rebele <tr...@tibco.com.invalid> wrote:
> >
> > Hi all,
> >
> > I'm working on basic query optimization. I once stumbled on the case that
> > two operators had the same row count but one had a much higher CPU cost.
> > Unfortunately the default cost model only takes the row count into account
> > (see [1]). Stamatis had pointed out in another mail that the row count
> > might be much more important than the other costs [2]. However, if there
> > are two possible choices with the same row count, we should prefer the one
> > with the least CPU cost. I'm wondering whether the assumption that a
> > smaller row count is better in most cases is actually correct. Also, what
> > is "better" in this context? The query plan with the least execution time?
> > Maybe there's a plan that is just <10% slower, but consumes much less
> > CPU/memory/etc.
> >
> > So I thought about the cost model in general, and how to improve it. I
> > assume the better the estimated cost corresponds to the real cost, the
> > better the optimized plans. So the first step would be to collect the real
> > world statistics and the second step to adapt the cost estimation so that
> > there's a better correspondence. For the beginning I would just measure how
> > many rows have been in the result and how much time has passed for each
> > RelNode during query execution. Is there already a way to do this in
> > Calcite? Does this make sense at all?
> >
> > [1]
> > https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100
> > [2]
> > https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf
> >
> > Cordialement / Best Regards,
> > *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris, France
> > | www.tibco.com

Re: Gather real world statistics about row count, CPU, and IO cost

Posted by Julian Hyde <jh...@apache.org>.

I consider there to be two fairly independent sub-problems:

1. Improve our cardinality estimates (especially for queries with
complex joins and aggregation).

2. Calibrate physical operators so that, given a good estimate of the
number of rows they will see, we can come up with a reasonable
estimate of the physical cost (e.g. how long the query will take to
execute, and how much memory).

For 1, the paper you cite, and the join-order benchmark it introduces,
is an excellent contribution to the field. It inspired me to do work
on profiling [1]. I would encourage you to build on the work I have
already done.

For 2, I have not done any work personally. An approach would be to
give each physical operator (e.g. EnumerableHashJoin) a cost model
parameterized by certain constants, and then run experiments to
determine the values of those constants empirically. Perhaps we could
write a "TuningTool" that generates an "operator constants file", and
thereby start to formalize the process.

Julian

[1] https://www.slideshare.net/julianhyde/data-profiling-in-apache-calcite

On Fri, Aug 7, 2020 at 6:57 AM Thomas Rebele <tr...@tibco.com.invalid> wrote:
>
> Hi all,
>
> I'm working on basic query optimization. I once stumbled on the case that
> two operators had the same row count but one had a much higher CPU cost.
> Unfortunately the default cost model only takes the row count into account
> (see [1]). Stamatis had pointed out in another mail that the row count
> might be much more important than the other costs [2]. However, if there
> are two possible choices with the same row count, we should prefer the one
> with the least CPU cost. I'm wondering whether the assumption that a
> smaller row count is better in most cases is actually correct. Also, what
> is "better" in this context? The query plan with the least execution time?
> Maybe there's a plan that is just <10% slower, but consumes much less
> CPU/memory/etc.
>
> So I thought about the cost model in general, and how to improve it. I
> assume the better the estimated cost corresponds to the real cost, the
> better the optimized plans. So the first step would be to collect the real
> world statistics and the second step to adapt the cost estimation so that
> there's a better correspondence. For the beginning I would just measure how
> many rows have been in the result and how much time has passed for each
> RelNode during query execution. Is there already a way to do this in
> Calcite? Does this make sense at all?
>
> [1]
> https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100
> [2]
> https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf
>
> Cordialement / Best Regards,
> *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris, France
> | www.tibco.com