You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by JiaTao Tao <ta...@gmail.com> on 2020/02/04 15:30:26 UTC

Is CBO really useful(you can not estimate cost very well )?

Under big data, does CBO have such a big effect?
Node like filter/join/aggregate, their cost is estimated.

There's one case, I call it runtime optimizing, it means optimizing while c
alculating, You adjust your execution plan in real-time based on the
execution statistics of the previous step(like join stragety selection,
hash or broadcast or SMJ), but it's not CBO, like in volcano planner.

Regards!

Aron Tao

Re: Is CBO really useful(you can not estimate cost very well )?

Posted by Rui Wang <am...@apache.org>.

From some real world practice I have heard (but we haven't tried it in
BeamSQL yet), CBO becomes bad as the complexity of plan increases as
estimation will be hard to be close to reality in the later stages for big
data processing.  Instead, the runtime optimization that you called out,
does work. Basically the idea is to use the previous stage's result to
estimate next stage's stats, run next stage. And then based on the running
result of the next stage to estimate the next next stage. It's still CBO
but just always estimates no more than one layer.

-Rui

On Tue, Feb 4, 2020 at 7:30 AM JiaTao Tao <ta...@gmail.com> wrote:

> Under big data, does CBO have such a big effect?
> Node like filter/join/aggregate, their cost is estimated.
>
> There's one case, I call it runtime optimizing, it means optimizing while c
> alculating, You adjust your execution plan in real-time based on the
> execution statistics of the previous step(like join stragety selection,
> hash or broadcast or SMJ), but it's not CBO, like in volcano planner.
>
> Regards!
>
> Aron Tao
>

Re: Is CBO really useful(you can not estimate cost very well )?

Posted by Stamatis Zampetakis <za...@gmail.com>.

Hi Aron,

When you say that the execution plan changes based on the stats of the
previous stage for me that is still cost-based optimization.
As other people mentioned cost-based optimization can be static (the plan
does not change after the initial optimization) or dynamic (the plan can
change at various intervals by receiving additional input).
Although I've not seen it in practice you could even apply the Volcano
approach on sub plans after having executed a portion of the query.

Now regarding the OOM, I think it's only a problem of in-memory processing.
You could have operators that write intermediate results to disk (all
traditional DBMS do this) instead of keeping everything in memory and in
that case you don't have to worry about that.

Speaking about dynamic optimization techniques, there is another recent
paper that comes to my mind [1] where the plan execution switches from
interpretation to compilation. The technique is quite popular in
programming languages (Java is what comes first to my mind) but less common
in DBMS.

Best,
Stamatis

[1] https://db.in.tum.de/~leis/papers/adaptiveexecution.pdf

On Wed, Feb 5, 2020 at 3:41 AM JiaTao Tao <ta...@apache.org> wrote:

> Hi Rui. Julian
> Thanks for your reply
> Volcano/Cascade is greate(Rules do not affect each other). The problem is
> the estimated cost does really help, particularly in big data?
> The estimates may be inaccurate, like this "JOIN-FILTER-TABLSCAN", we
> estimate that the filtering rate is large and chose broadcast join but the
> really filter rate is low can leads OOM.
> We can do specific optimization like "join reorder"(maybe the only one)
> with CBO and do other optimization by RBO, the join/aggregate selection
> that we can do by the runtime. I think rules like "pushing that
> filter/aggregation through that join" are heuristics and can be done by
> RBO.
>
>
> Regards!
>
> Aron Tao
>
>
> Julian Hyde <jh...@apache.org> 于2020年2月5日周三 上午5:50写道：
>
> > Runtime optimization is important too. Examples include the DB’s buffer
> > cache (predicting future disk accesses based on previous accesses using
> > e.g. LRU2) and self-tuning algorithms such as hybrid hash join (which
> > figures out what key to use to partition, and which side of a join to
> > build, based on data it sees while it is running).
> >
> > It’s not either/or. You have to have both planning time and runtime
> > optimization.
> >
> > Stats need to get a lot better. Especially if you are optimizing N-way
> > joins for N > 5. I’m a big fan of the “How good are optimizers, really?”
> > paper and the join-order benchmark. And approaches such as materializing
> > intermediate result sets and re-planning.
> >
> > But even if your stats are not great, CBO is an organizing principle for
> > an optimizer. Is it worth pushing that filter through that join? Cost
> > before and after, and see which is better. Is it still worth it if I push
> > down the aggregation? Again, run the costs. Without an organizing
> > principle, every optimization (transform rule) has to be aware of every
> > other rule, and that leads to a software architecture that is too tightly
> > coupled, and eventually it is not possible to add a new feature without
> > breaking existing functionality.
> >
> > My definition of a “cost-based optimizer” is one that is
> nondeterministic.
> > By which I mean, it can investigate multiple paths, thereby avoiding
> local
> > minima.
> >
> > Therefore I am a fan of simple, crappy stats such as estimated row count.
> >
> > Julian
> >
> > [1] http://www.vldb.org/pvldb/vol9/p204-leis.pdf <
> > http://www.vldb.org/pvldb/vol9/p204-leis.pdf>
> >
> > > On Feb 4, 2020, at 7:30 AM, JiaTao Tao <ta...@gmail.com> wrote:
> > >
> > > Under big data, does CBO have such a big effect?
> > > Node like filter/join/aggregate, their cost is estimated.
> > >
> > > There's one case, I call it runtime optimizing, it means optimizing
> > while c
> > > alculating, You adjust your execution plan in real-time based on the
> > > execution statistics of the previous step(like join stragety selection,
> > > hash or broadcast or SMJ), but it's not CBO, like in volcano planner.
> > >
> > > Regards!
> > >
> > > Aron Tao
> >
> >
>

Re: Is CBO really useful(you can not estimate cost very well )?

Posted by JiaTao Tao <ta...@apache.org>.

Hi Rui. Julian
Thanks for your reply
Volcano/Cascade is greate(Rules do not affect each other). The problem is
the estimated cost does really help, particularly in big data?
The estimates may be inaccurate, like this "JOIN-FILTER-TABLSCAN", we
estimate that the filtering rate is large and chose broadcast join but the
really filter rate is low can leads OOM.
We can do specific optimization like "join reorder"(maybe the only one)
with CBO and do other optimization by RBO, the join/aggregate selection
that we can do by the runtime. I think rules like "pushing that
filter/aggregation through that join" are heuristics and can be done by RBO.


Regards!

Aron Tao


Julian Hyde <jh...@apache.org> 于2020年2月5日周三 上午5:50写道：

> Runtime optimization is important too. Examples include the DB’s buffer
> cache (predicting future disk accesses based on previous accesses using
> e.g. LRU2) and self-tuning algorithms such as hybrid hash join (which
> figures out what key to use to partition, and which side of a join to
> build, based on data it sees while it is running).
>
> It’s not either/or. You have to have both planning time and runtime
> optimization.
>
> Stats need to get a lot better. Especially if you are optimizing N-way
> joins for N > 5. I’m a big fan of the “How good are optimizers, really?”
> paper and the join-order benchmark. And approaches such as materializing
> intermediate result sets and re-planning.
>
> But even if your stats are not great, CBO is an organizing principle for
> an optimizer. Is it worth pushing that filter through that join? Cost
> before and after, and see which is better. Is it still worth it if I push
> down the aggregation? Again, run the costs. Without an organizing
> principle, every optimization (transform rule) has to be aware of every
> other rule, and that leads to a software architecture that is too tightly
> coupled, and eventually it is not possible to add a new feature without
> breaking existing functionality.
>
> My definition of a “cost-based optimizer” is one that is nondeterministic.
> By which I mean, it can investigate multiple paths, thereby avoiding local
> minima.
>
> Therefore I am a fan of simple, crappy stats such as estimated row count.
>
> Julian
>
> [1] http://www.vldb.org/pvldb/vol9/p204-leis.pdf <
> http://www.vldb.org/pvldb/vol9/p204-leis.pdf>
>
> > On Feb 4, 2020, at 7:30 AM, JiaTao Tao <ta...@gmail.com> wrote:
> >
> > Under big data, does CBO have such a big effect?
> > Node like filter/join/aggregate, their cost is estimated.
> >
> > There's one case, I call it runtime optimizing, it means optimizing
> while c
> > alculating, You adjust your execution plan in real-time based on the
> > execution statistics of the previous step(like join stragety selection,
> > hash or broadcast or SMJ), but it's not CBO, like in volcano planner.
> >
> > Regards!
> >
> > Aron Tao
>
>

Re: Is CBO really useful(you can not estimate cost very well )?

Posted by Julian Hyde <jh...@apache.org>.

Runtime optimization is important too. Examples include the DB’s buffer cache (predicting future disk accesses based on previous accesses using e.g. LRU2) and self-tuning algorithms such as hybrid hash join (which figures out what key to use to partition, and which side of a join to build, based on data it sees while it is running).

It’s not either/or. You have to have both planning time and runtime optimization.

Stats need to get a lot better. Especially if you are optimizing N-way joins for N > 5. I’m a big fan of the “How good are optimizers, really?” paper and the join-order benchmark. And approaches such as materializing intermediate result sets and re-planning.

But even if your stats are not great, CBO is an organizing principle for an optimizer. Is it worth pushing that filter through that join? Cost before and after, and see which is better. Is it still worth it if I push down the aggregation? Again, run the costs. Without an organizing principle, every optimization (transform rule) has to be aware of every other rule, and that leads to a software architecture that is too tightly coupled, and eventually it is not possible to add a new feature without breaking existing functionality.

My definition of a “cost-based optimizer” is one that is nondeterministic. By which I mean, it can investigate multiple paths, thereby avoiding local minima.

Therefore I am a fan of simple, crappy stats such as estimated row count.

Julian

[1] http://www.vldb.org/pvldb/vol9/p204-leis.pdf <http://www.vldb.org/pvldb/vol9/p204-leis.pdf> 

> On Feb 4, 2020, at 7:30 AM, JiaTao Tao <ta...@gmail.com> wrote:
> 
> Under big data, does CBO have such a big effect?
> Node like filter/join/aggregate, their cost is estimated.
> 
> There's one case, I call it runtime optimizing, it means optimizing while c
> alculating, You adjust your execution plan in real-time based on the
> execution statistics of the previous step(like join stragety selection,
> hash or broadcast or SMJ), but it's not CBO, like in volcano planner.
> 
> Regards!
> 
> Aron Tao