You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Andrew Lamb <al...@influxdata.com> on 2021/09/10 18:45:07 UTC

[Rust] [DataFusion] Move Statistics and Cost Based Optimizations to physical plans

I would like to draw some attention to a PR [1] that proposes to move
statistics (and the cost based optimizations that rely on them) into the
physical planning realm.

The feedback so far on this change seems positive but since it may have
non-trivial architectural impact on downstream projects that use
statistics, I wanted to give it some extra visibility.

Thanks,
Andrew

[1] https://github.com/apache/arrow-datafusion/pull/965

Re: [Rust] [DataFusion] Move Statistics and Cost Based Optimizations to physical plans

Posted by Rémi Dettai <rd...@gmail.com>.
Thanks Andrew for bringing this PR forward. I would just like to give the
big picture that led to this modification.

We would like to make Datafusion more efficiently integratable with table
formats. I have recently written a design document in that sense [1] that
goes through the various ways we could achieve that. The conclusion from
that document is that the best moment to query the table format for the
list of data files and statistics is at the moment of the conversion from
logical plan to physical plan. This is why one of the first steps is to
move the statistics out of the logical plan abstraction (because they are
unknown at the moment it is built) and bring them into the physical plan.

Making optimizations on the physical plan comes with its fair share of
challenges. The physical plan is much more complex and can take many more
shapes than the logical plan. This will definitely be a big challenge when
adding new optimizations rules and maintaining existing ones. But it also
unlocks a great potential, in particular in terms of adaptive query
optimizations, that will become fairly easy to implement thanks to this
change.

Your feedback is very welcome, both on the PR and the design document.

Remi

[1]
https://docs.google.com/document/d/1Bd4-PLLH-pHj0BquMDsJ6cVr_awnxTuvwNJuWsTHxAQ/edit?usp=sharing

Le ven. 10 sept. 2021 à 20:45, Andrew Lamb <al...@influxdata.com> a écrit :

> I would like to draw some attention to a PR [1] that proposes to move
> statistics (and the cost based optimizations that rely on them) into the
> physical planning realm.
>
> The feedback so far on this change seems positive but since it may have
> non-trivial architectural impact on downstream projects that use
> statistics, I wanted to give it some extra visibility.
>
> Thanks,
> Andrew
>
> [1] https://github.com/apache/arrow-datafusion/pull/965
>