You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Vladimir Ozerov <pp...@gmail.com> on 2021/02/13 07:03:42 UTC

Add cross-join suppression to JoinAssociateRule

Hi,

Join order planning is an important optimization problem. The widely-used
heuristic is to consider all bushy trees without cross-joins. There is
proof [1] that a pair of commute and associate rules is a complete ruleset
to explore all bushy trees without the cross-joins if the initial join tree
has no cross-joins.

In Apache Calcite, we do not suppress cross-products. As a result, even
simple join topologies, like a chain, cannot be planned in a reasonable
time for more than 6-7 tables. How do you feel if we add an optional
cross-join suppression to the JoinAssociateRule, and possibly
JoinPushThroughJoinRule?

The cross-join suppression is not the only problem with the exhaustive join
planning in Apache Calcite. But at the very least, it is simple to
implement, and it extends the number of tables in a join that could be
planned exhaustively for some topologies by additional 2-5 tables.

Regards,
Vladimir.

[1] https://dl.acm.org/doi/10.14778/2732977.2732997

Re: Add cross-join suppression to JoinAssociateRule

Posted by Vladimir Ozerov <pp...@gmail.com>.

I created the PR [1]. It adds a property that prunes new join trees if
there is an "always true" condition. Strictly speaking, this is not a
suppression of all possible cross-joins. But such check is cheap and should
work well in practical optimizers assuming that a filter-pushdown is
applied, and joins contain only those conditions, that couldn't be
pushed down.

[1] https://github.com/apache/calcite/pull/2359


сб, 20 февр. 2021 г. в 20:34, Vladimir Ozerov <pp...@gmail.com>:

> Thank you for the feedback. I'll try to prototype it for all the affected
> rules.
>
> Regards,
> Vladimir
>
> чт, 18 февр. 2021 г. в 13:02, Stamatis Zampetakis <za...@gmail.com>:
>
>> Hi Vladimir,
>>
>> Thanks for bringing up this topic.
>>
>> There are various works who highlight the importance of cartesian products
>> for optimality and try to derive techniques for exploring the complete
>> search space efficiently.
>> Having said that indeed not considering cartesian products is a common &
>> popular heuristic for queries with many relations so I find it a good idea
>> to give to the users the option to use it or not.
>>
>> Best,
>> Stamatis
>>
>> On Sun, Feb 14, 2021 at 9:58 AM Vladimir Ozerov <pp...@gmail.com>
>> wrote:
>>
>> > Hi Julian,
>> >
>> > First of all, I want to clarify that if we decide to do the cross-join
>> > suppression, we would have to do that for JoinCommuteRule either. I
>> missed
>> > that in the original email.
>> >
>> > Now back to your comments. The proposal assumes that we start with a
>> > connected join graph without the cross-products and never generate other
>> > cross-products. This covers many common queries, but not all. First, as
>> you
>> > mentioned, some queries may potentially benefit from cross-products. I
>> do
>> > not have specific queries in mind, but I cannot prove that such queries
>> > don't exist either. Second, we may have cross-products in an otherwise
>> > connected join graph if the join condition is located in the WHERE
>> clause.
>> > For such queries, we may need to do some pre-processing, like filter
>> > pushdown. Therefore, I assume that real-world products would have to do
>> > some preliminary analysis and/or processing to decide whether it is
>> safe to
>> > use the cross-join suppression. This might not be very convenient for
>> new
>> > users, but IMO perfectly fine for production-grade systems because they
>> > typically do complex multi-phase optimization anyway. For example,
>> recall
>> > how VoltDB decides whether to perform the join planning in the physical
>> > phase depending on the number of joins in the query [1].
>> >
>> > Regarding cardinality estimations, the main problem is that the built-in
>> > Calcite mechanism is pretty imprecise because they use mainly row count
>> and
>> > magic numbers. For example, in TPC-H queries, the cardinality
>> estimations
>> > easily reach billions and billions of rows. And even if the system has
>> more
>> > advanced cardinality estimators, such as histograms, the estimation
>> errors
>> > are likely to build up quickly still [2]. That said, it could be
>> difficult
>> > to design a robust heuristic to suppress particular joins. The
>> advantage of
>> > a straightforward cross-join suppression heuristic is that it is a good
>> > choice for a good fraction of real-world queries.
>> >
>> > To clarify, the proposal is not to make the cross-join suppression the
>> > default behavior. Rather, we may add it as a configuration property to
>> join
>> > planning rules, that could be used by advanced users.
>> >
>> > Regards,
>> > Vladimir.
>> >
>> > [1]
>> >
>> >
>> https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274
>> > [2] https://dl.acm.org/doi/10.14778/2850583.2850594
>> >
>> > сб, 13 февр. 2021 г. в 23:05, Julian Hyde <jh...@gmail.com>:
>> >
>> > > Two separate points.
>> > >
>> > > 1. I recall that there is an important class of plans that first forms
>> > the
>> > > Cartesian product of some very small relations, and then joins
>> everything
>> > > else to it. Does anyone else encounter these plans? Would these plans
>> be
>> > > unavailable if we made your proposed change?
>> > >
>> > > 2. If I join on a low cardinality attribute (eg customers to
>> suppliers on
>> > > state) the result is almost as bad a cross join. I presume you would
>> want
>> > > to treat this join similarly. If so, it would make sense to have the
>> > rules
>> > > on selectivity (or similar) rather than structure alone.
>> > >
>> > > Julian
>> > >
>> > > > On Feb 12, 2021, at 23:04, Vladimir Ozerov <pp...@gmail.com>
>> wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > Join order planning is an important optimization problem. The
>> > widely-used
>> > > > heuristic is to consider all bushy trees without cross-joins. There
>> is
>> > > > proof [1] that a pair of commute and associate rules is a complete
>> > > ruleset
>> > > > to explore all bushy trees without the cross-joins if the initial
>> join
>> > > tree
>> > > > has no cross-joins.
>> > > >
>> > > > In Apache Calcite, we do not suppress cross-products. As a result,
>> even
>> > > > simple join topologies, like a chain, cannot be planned in a
>> reasonable
>> > > > time for more than 6-7 tables. How do you feel if we add an optional
>> > > > cross-join suppression to the JoinAssociateRule, and possibly
>> > > > JoinPushThroughJoinRule?
>> > > >
>> > > > The cross-join suppression is not the only problem with the
>> exhaustive
>> > > join
>> > > > planning in Apache Calcite. But at the very least, it is simple to
>> > > > implement, and it extends the number of tables in a join that could
>> be
>> > > > planned exhaustively for some topologies by additional 2-5 tables.
>> > > >
>> > > > Regards,
>> > > > Vladimir.
>> > > >
>> > > > [1] https://dl.acm.org/doi/10.14778/2732977.2732997
>> > >
>> >
>>
>

Re: Add cross-join suppression to JoinAssociateRule

Posted by Vladimir Ozerov <pp...@gmail.com>.

Thank you for the feedback. I'll try to prototype it for all the affected
rules.

Regards,
Vladimir

чт, 18 февр. 2021 г. в 13:02, Stamatis Zampetakis <za...@gmail.com>:

> Hi Vladimir,
>
> Thanks for bringing up this topic.
>
> There are various works who highlight the importance of cartesian products
> for optimality and try to derive techniques for exploring the complete
> search space efficiently.
> Having said that indeed not considering cartesian products is a common &
> popular heuristic for queries with many relations so I find it a good idea
> to give to the users the option to use it or not.
>
> Best,
> Stamatis
>
> On Sun, Feb 14, 2021 at 9:58 AM Vladimir Ozerov <pp...@gmail.com>
> wrote:
>
> > Hi Julian,
> >
> > First of all, I want to clarify that if we decide to do the cross-join
> > suppression, we would have to do that for JoinCommuteRule either. I
> missed
> > that in the original email.
> >
> > Now back to your comments. The proposal assumes that we start with a
> > connected join graph without the cross-products and never generate other
> > cross-products. This covers many common queries, but not all. First, as
> you
> > mentioned, some queries may potentially benefit from cross-products. I do
> > not have specific queries in mind, but I cannot prove that such queries
> > don't exist either. Second, we may have cross-products in an otherwise
> > connected join graph if the join condition is located in the WHERE
> clause.
> > For such queries, we may need to do some pre-processing, like filter
> > pushdown. Therefore, I assume that real-world products would have to do
> > some preliminary analysis and/or processing to decide whether it is safe
> to
> > use the cross-join suppression. This might not be very convenient for new
> > users, but IMO perfectly fine for production-grade systems because they
> > typically do complex multi-phase optimization anyway. For example, recall
> > how VoltDB decides whether to perform the join planning in the physical
> > phase depending on the number of joins in the query [1].
> >
> > Regarding cardinality estimations, the main problem is that the built-in
> > Calcite mechanism is pretty imprecise because they use mainly row count
> and
> > magic numbers. For example, in TPC-H queries, the cardinality estimations
> > easily reach billions and billions of rows. And even if the system has
> more
> > advanced cardinality estimators, such as histograms, the estimation
> errors
> > are likely to build up quickly still [2]. That said, it could be
> difficult
> > to design a robust heuristic to suppress particular joins. The advantage
> of
> > a straightforward cross-join suppression heuristic is that it is a good
> > choice for a good fraction of real-world queries.
> >
> > To clarify, the proposal is not to make the cross-join suppression the
> > default behavior. Rather, we may add it as a configuration property to
> join
> > planning rules, that could be used by advanced users.
> >
> > Regards,
> > Vladimir.
> >
> > [1]
> >
> >
> https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274
> > [2] https://dl.acm.org/doi/10.14778/2850583.2850594
> >
> > сб, 13 февр. 2021 г. в 23:05, Julian Hyde <jh...@gmail.com>:
> >
> > > Two separate points.
> > >
> > > 1. I recall that there is an important class of plans that first forms
> > the
> > > Cartesian product of some very small relations, and then joins
> everything
> > > else to it. Does anyone else encounter these plans? Would these plans
> be
> > > unavailable if we made your proposed change?
> > >
> > > 2. If I join on a low cardinality attribute (eg customers to suppliers
> on
> > > state) the result is almost as bad a cross join. I presume you would
> want
> > > to treat this join similarly. If so, it would make sense to have the
> > rules
> > > on selectivity (or similar) rather than structure alone.
> > >
> > > Julian
> > >
> > > > On Feb 12, 2021, at 23:04, Vladimir Ozerov <pp...@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Join order planning is an important optimization problem. The
> > widely-used
> > > > heuristic is to consider all bushy trees without cross-joins. There
> is
> > > > proof [1] that a pair of commute and associate rules is a complete
> > > ruleset
> > > > to explore all bushy trees without the cross-joins if the initial
> join
> > > tree
> > > > has no cross-joins.
> > > >
> > > > In Apache Calcite, we do not suppress cross-products. As a result,
> even
> > > > simple join topologies, like a chain, cannot be planned in a
> reasonable
> > > > time for more than 6-7 tables. How do you feel if we add an optional
> > > > cross-join suppression to the JoinAssociateRule, and possibly
> > > > JoinPushThroughJoinRule?
> > > >
> > > > The cross-join suppression is not the only problem with the
> exhaustive
> > > join
> > > > planning in Apache Calcite. But at the very least, it is simple to
> > > > implement, and it extends the number of tables in a join that could
> be
> > > > planned exhaustively for some topologies by additional 2-5 tables.
> > > >
> > > > Regards,
> > > > Vladimir.
> > > >
> > > > [1] https://dl.acm.org/doi/10.14778/2732977.2732997
> > >
> >
>

Re: Add cross-join suppression to JoinAssociateRule

Posted by Stamatis Zampetakis <za...@gmail.com>.

Hi Vladimir,

Thanks for bringing up this topic.

There are various works who highlight the importance of cartesian products
for optimality and try to derive techniques for exploring the complete
search space efficiently.
Having said that indeed not considering cartesian products is a common &
popular heuristic for queries with many relations so I find it a good idea
to give to the users the option to use it or not.

Best,
Stamatis

On Sun, Feb 14, 2021 at 9:58 AM Vladimir Ozerov <pp...@gmail.com> wrote:

> Hi Julian,
>
> First of all, I want to clarify that if we decide to do the cross-join
> suppression, we would have to do that for JoinCommuteRule either. I missed
> that in the original email.
>
> Now back to your comments. The proposal assumes that we start with a
> connected join graph without the cross-products and never generate other
> cross-products. This covers many common queries, but not all. First, as you
> mentioned, some queries may potentially benefit from cross-products. I do
> not have specific queries in mind, but I cannot prove that such queries
> don't exist either. Second, we may have cross-products in an otherwise
> connected join graph if the join condition is located in the WHERE clause.
> For such queries, we may need to do some pre-processing, like filter
> pushdown. Therefore, I assume that real-world products would have to do
> some preliminary analysis and/or processing to decide whether it is safe to
> use the cross-join suppression. This might not be very convenient for new
> users, but IMO perfectly fine for production-grade systems because they
> typically do complex multi-phase optimization anyway. For example, recall
> how VoltDB decides whether to perform the join planning in the physical
> phase depending on the number of joins in the query [1].
>
> Regarding cardinality estimations, the main problem is that the built-in
> Calcite mechanism is pretty imprecise because they use mainly row count and
> magic numbers. For example, in TPC-H queries, the cardinality estimations
> easily reach billions and billions of rows. And even if the system has more
> advanced cardinality estimators, such as histograms, the estimation errors
> are likely to build up quickly still [2]. That said, it could be difficult
> to design a robust heuristic to suppress particular joins. The advantage of
> a straightforward cross-join suppression heuristic is that it is a good
> choice for a good fraction of real-world queries.
>
> To clarify, the proposal is not to make the cross-join suppression the
> default behavior. Rather, we may add it as a configuration property to join
> planning rules, that could be used by advanced users.
>
> Regards,
> Vladimir.
>
> [1]
>
> https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274
> [2] https://dl.acm.org/doi/10.14778/2850583.2850594
>
> сб, 13 февр. 2021 г. в 23:05, Julian Hyde <jh...@gmail.com>:
>
> > Two separate points.
> >
> > 1. I recall that there is an important class of plans that first forms
> the
> > Cartesian product of some very small relations, and then joins everything
> > else to it. Does anyone else encounter these plans? Would these plans be
> > unavailable if we made your proposed change?
> >
> > 2. If I join on a low cardinality attribute (eg customers to suppliers on
> > state) the result is almost as bad a cross join. I presume you would want
> > to treat this join similarly. If so, it would make sense to have the
> rules
> > on selectivity (or similar) rather than structure alone.
> >
> > Julian
> >
> > > On Feb 12, 2021, at 23:04, Vladimir Ozerov <pp...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Join order planning is an important optimization problem. The
> widely-used
> > > heuristic is to consider all bushy trees without cross-joins. There is
> > > proof [1] that a pair of commute and associate rules is a complete
> > ruleset
> > > to explore all bushy trees without the cross-joins if the initial join
> > tree
> > > has no cross-joins.
> > >
> > > In Apache Calcite, we do not suppress cross-products. As a result, even
> > > simple join topologies, like a chain, cannot be planned in a reasonable
> > > time for more than 6-7 tables. How do you feel if we add an optional
> > > cross-join suppression to the JoinAssociateRule, and possibly
> > > JoinPushThroughJoinRule?
> > >
> > > The cross-join suppression is not the only problem with the exhaustive
> > join
> > > planning in Apache Calcite. But at the very least, it is simple to
> > > implement, and it extends the number of tables in a join that could be
> > > planned exhaustively for some topologies by additional 2-5 tables.
> > >
> > > Regards,
> > > Vladimir.
> > >
> > > [1] https://dl.acm.org/doi/10.14778/2732977.2732997
> >
>

Re: Add cross-join suppression to JoinAssociateRule

Posted by Vladimir Ozerov <pp...@gmail.com>.

Hi Julian,

First of all, I want to clarify that if we decide to do the cross-join
suppression, we would have to do that for JoinCommuteRule either. I missed
that in the original email.

Now back to your comments. The proposal assumes that we start with a
connected join graph without the cross-products and never generate other
cross-products. This covers many common queries, but not all. First, as you
mentioned, some queries may potentially benefit from cross-products. I do
not have specific queries in mind, but I cannot prove that such queries
don't exist either. Second, we may have cross-products in an otherwise
connected join graph if the join condition is located in the WHERE clause.
For such queries, we may need to do some pre-processing, like filter
pushdown. Therefore, I assume that real-world products would have to do
some preliminary analysis and/or processing to decide whether it is safe to
use the cross-join suppression. This might not be very convenient for new
users, but IMO perfectly fine for production-grade systems because they
typically do complex multi-phase optimization anyway. For example, recall
how VoltDB decides whether to perform the join planning in the physical
phase depending on the number of joins in the query [1].

Regarding cardinality estimations, the main problem is that the built-in
Calcite mechanism is pretty imprecise because they use mainly row count and
magic numbers. For example, in TPC-H queries, the cardinality estimations
easily reach billions and billions of rows. And even if the system has more
advanced cardinality estimators, such as histograms, the estimation errors
are likely to build up quickly still [2]. That said, it could be difficult
to design a robust heuristic to suppress particular joins. The advantage of
a straightforward cross-join suppression heuristic is that it is a good
choice for a good fraction of real-world queries.

To clarify, the proposal is not to make the cross-join suppression the
default behavior. Rather, we may add it as a configuration property to join
planning rules, that could be used by advanced users.

Regards,
Vladimir.

[1]
https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274
[2] https://dl.acm.org/doi/10.14778/2850583.2850594

сб, 13 февр. 2021 г. в 23:05, Julian Hyde <jh...@gmail.com>:

> Two separate points.
>
> 1. I recall that there is an important class of plans that first forms the
> Cartesian product of some very small relations, and then joins everything
> else to it. Does anyone else encounter these plans? Would these plans be
> unavailable if we made your proposed change?
>
> 2. If I join on a low cardinality attribute (eg customers to suppliers on
> state) the result is almost as bad a cross join. I presume you would want
> to treat this join similarly. If so, it would make sense to have the rules
> on selectivity (or similar) rather than structure alone.
>
> Julian
>
> > On Feb 12, 2021, at 23:04, Vladimir Ozerov <pp...@gmail.com> wrote:
> >
> > Hi,
> >
> > Join order planning is an important optimization problem. The widely-used
> > heuristic is to consider all bushy trees without cross-joins. There is
> > proof [1] that a pair of commute and associate rules is a complete
> ruleset
> > to explore all bushy trees without the cross-joins if the initial join
> tree
> > has no cross-joins.
> >
> > In Apache Calcite, we do not suppress cross-products. As a result, even
> > simple join topologies, like a chain, cannot be planned in a reasonable
> > time for more than 6-7 tables. How do you feel if we add an optional
> > cross-join suppression to the JoinAssociateRule, and possibly
> > JoinPushThroughJoinRule?
> >
> > The cross-join suppression is not the only problem with the exhaustive
> join
> > planning in Apache Calcite. But at the very least, it is simple to
> > implement, and it extends the number of tables in a join that could be
> > planned exhaustively for some topologies by additional 2-5 tables.
> >
> > Regards,
> > Vladimir.
> >
> > [1] https://dl.acm.org/doi/10.14778/2732977.2732997
>

Re: Add cross-join suppression to JoinAssociateRule

Posted by Julian Hyde <jh...@gmail.com>.

Two separate points.

1. I recall that there is an important class of plans that first forms the Cartesian product of some very small relations, and then joins everything else to it. Does anyone else encounter these plans? Would these plans be unavailable if we made your proposed change?

2. If I join on a low cardinality attribute (eg customers to suppliers on state) the result is almost as bad a cross join. I presume you would want to treat this join similarly. If so, it would make sense to have the rules on selectivity (or similar) rather than structure alone.

Julian

> On Feb 12, 2021, at 23:04, Vladimir Ozerov <pp...@gmail.com> wrote:
> 
> Hi,
> 
> Join order planning is an important optimization problem. The widely-used
> heuristic is to consider all bushy trees without cross-joins. There is
> proof [1] that a pair of commute and associate rules is a complete ruleset
> to explore all bushy trees without the cross-joins if the initial join tree
> has no cross-joins.
> 
> In Apache Calcite, we do not suppress cross-products. As a result, even
> simple join topologies, like a chain, cannot be planned in a reasonable
> time for more than 6-7 tables. How do you feel if we add an optional
> cross-join suppression to the JoinAssociateRule, and possibly
> JoinPushThroughJoinRule?
> 
> The cross-join suppression is not the only problem with the exhaustive join
> planning in Apache Calcite. But at the very least, it is simple to
> implement, and it extends the number of tables in a join that could be
> planned exhaustively for some topologies by additional 2-5 tables.
> 
> Regards,
> Vladimir.
> 
> [1] https://dl.acm.org/doi/10.14778/2732977.2732997