You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by "Gramana, Zachary (GE Digital)" <Za...@ge.com> on 2022/03/12 01:26:53 UTC

[Help] UNION queries with one or more SELECTs containing a JOIN are significantly slower after 1.24

Hello all!

I would appreciate some help/guidance with troubleshooting a big performance regression we saw on our UNION queries after updating our application from 1.24 to 1.29. Many of the tables in the queries are wide (> 50 columns) so we use a custom push-down Project class. These custom Projects are being eliminated in 1.29 UNION queries--though not when replacing UNION with UNION ALL.

The examples below illustrate minimal queries that reproduce the issue our test database.

When compared to the UNION query times using 1.24, the prepared UNION statements take 2-2.5 times longer in 1.29. The prepare time is around 2 times longer as well.

In the examples below, the UNION and UNION ALL queries generate the same result set in our test database. All Calcite tables have default collations ("ISO-8859-1$en_US"). Several are rather wide, with many > 50 columns.

In 1.29 for a simple query, both UNION and UNION ALL produce the same query plan, with the exception of EnumerableUnion's `all` parameter:

1.29
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION ALL
SELECT Id FROM t2
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[true])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t2]])

1.29
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION
SELECT Id FROM t2
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[false])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t2]])

The next examples add a JOIN to one of the tables in order to make this a slightly more complicated query and trigger the issue. Unlike in 1.24, the plans in 1.29 for UNION and UNION ALL now vary significantly:

1.24 (like the above queries,  the UNION and UNION ALL plans differ only by the `all` parameter, so the UNION ALL plan for 1.24 is omitted)
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION
SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[false])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
    EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        EnumerableCalc(expr#0..262=[{inputs}], expr#263=[CAST($t178):BIGINT NOT NULL], t3_Id0=[$t263])
          MyEnumerableConverter
            MyTableScan(table=[[MySchema, t2]])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        MyEnumerableConverter
          MyProject(Id=[$0])
            MyTableScan(table=[[MySchema, t3]])

1.29 (note this UNION ALL plan is nearly identical plan to the above `UNION` plan in 1.24)
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION ALL
SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[true])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
    EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        EnumerableCalc(expr#0..262=[{inputs}], expr#263=[CAST($t178):BIGINT NOT NULL], t3_Id0=[$t263])
          MyEnumerableConverter
            MyTableScan(table=[[MySchema, t2]])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        MyEnumerableConverter
          MyProject(Id=[$0])
            MyTableScan(table=[[MySchema, t3]])

1.29 (note the loss of the `MyProject` node above t3's MyTableScan, and now the inputs for the post-MergeJoin EnumerableCalc node are very wide compared to before)
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION
SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[false])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  EnumerableCalc(expr#0..402=[{inputs}], Id=[$t264])
    EnumerableMergeJoin(condition=[=($263, $264)], joinType=[inner])
      EnumerableSort(sort0=[$263], dir0=[ASC])
        EnumerableCalc(expr#0..262=[{inputs}], expr#263=[CAST($t178):BIGINT NOT NULL], proj#0..263=[{exprs}])
          MyEnumerableConverter
            MyTableScan(table=[[MySchema, t2]])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        MyEnumerableConverter
          MyTableScan(table=[[MySchema, t3]])

I've tried removing various rules and even restricting them, but the custom Projects are still removed in the EnumerableMergeJoin despite setting the self-cost to zero.

Am I missing something? Any suggestions on where to look next?

Thanks for any assistance/tips.

Best,
Zack

Re: [Help] UNION queries with one or more SELECTs containing a JOIN are significantly slower after 1.24

Posted by Ruben Q L <ru...@gmail.com>.

Hello,

At first glance, it could also be possible that the regression is
originated in the RelFieldTrimmer. Just a hypothesis.

Best,
Ruben



On Sat, Mar 12, 2022 at 6:35 AM Haisheng Yuan <hy...@apache.org> wrote:

> Hi Zack,
>
> Looks like it is a regression.
> Are you able to provide a reproducible test case? You can log a JIRA along
> with the test case, so people can do the root cause analysis.
>
> Thanks,
> Haisheng Yuan
>
> On 2022/03/12 01:26:53 "Gramana, Zachary (GE Digital)" wrote:
> > Hello all!
> >
> > I would appreciate some help/guidance with troubleshooting a big
> performance regression we saw on our UNION queries after updating our
> application from 1.24 to 1.29. Many of the tables in the queries are wide
> (> 50 columns) so we use a custom push-down Project class. These custom
> Projects are being eliminated in 1.29 UNION queries--though not when
> replacing UNION with UNION ALL.
> >
> > The examples below illustrate minimal queries that reproduce the issue
> our test database.
> >
> > When compared to the UNION query times using 1.24, the prepared UNION
> statements take 2-2.5 times longer in 1.29. The prepare time is around 2
> times longer as well.
> >
> > In the examples below, the UNION and UNION ALL queries generate the same
> result set in our test database. All Calcite tables have default collations
> ("ISO-8859-1$en_US"). Several are rather wide, with many > 50 columns.
> >
> > In 1.29 for a simple query, both UNION and UNION ALL produce the same
> query plan, with the exception of EnumerableUnion's `all` parameter:
> >
> > 1.29
> > EXPLAIN PLAN FOR
> > SELECT Id FROM t1
> > UNION ALL
> > SELECT Id FROM t2
> >
> -----------------------------------------------------------------------------------------------------
> > EnumerableUnion(all=[true])
> >   MyEnumerableConverter
> >     MyProject(Id=[$0])
> >       MyTableScan(table=[[MySchema, t1]])
> >   MyEnumerableConverter
> >     MyProject(Id=[$0])
> >       MyTableScan(table=[[MySchema, t2]])
> >
> > 1.29
> > EXPLAIN PLAN FOR
> > SELECT Id FROM t1
> > UNION
> > SELECT Id FROM t2
> >
> -----------------------------------------------------------------------------------------------------
> > EnumerableUnion(all=[false])
> >   MyEnumerableConverter
> >     MyProject(Id=[$0])
> >       MyTableScan(table=[[MySchema, t1]])
> >   MyEnumerableConverter
> >     MyProject(Id=[$0])
> >       MyTableScan(table=[[MySchema, t2]])
> >
> > The next examples add a JOIN to one of the tables in order to make this
> a slightly more complicated query and trigger the issue. Unlike in 1.24,
> the plans in 1.29 for UNION and UNION ALL now vary significantly:
> >
> > 1.24 (like the above queries,  the UNION and UNION ALL plans differ only
> by the `all` parameter, so the UNION ALL plan for 1.24 is omitted)
> > EXPLAIN PLAN FOR
> > SELECT Id FROM t1
> > UNION
> > SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
> >
> -----------------------------------------------------------------------------------------------------
> > EnumerableUnion(all=[false])
> >   MyEnumerableConverter
> >     MyProject(Id=[$0])
> >       MyTableScan(table=[[MySchema, t1]])
> >   EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
> >     EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
> >       EnumerableSort(sort0=[$0], dir0=[ASC])
> >         EnumerableCalc(expr#0..262=[{inputs}],
> expr#263=[CAST($t178):BIGINT NOT NULL], t3_Id0=[$t263])
> >           MyEnumerableConverter
> >             MyTableScan(table=[[MySchema, t2]])
> >       EnumerableSort(sort0=[$0], dir0=[ASC])
> >         MyEnumerableConverter
> >           MyProject(Id=[$0])
> >             MyTableScan(table=[[MySchema, t3]])
> >
> > 1.29 (note this UNION ALL plan is nearly identical plan to the above
> `UNION` plan in 1.24)
> > EXPLAIN PLAN FOR
> > SELECT Id FROM t1
> > UNION ALL
> > SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
> >
> -----------------------------------------------------------------------------------------------------
> > EnumerableUnion(all=[true])
> >   MyEnumerableConverter
> >     MyProject(Id=[$0])
> >       MyTableScan(table=[[MySchema, t1]])
> >   EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
> >     EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
> >       EnumerableSort(sort0=[$0], dir0=[ASC])
> >         EnumerableCalc(expr#0..262=[{inputs}],
> expr#263=[CAST($t178):BIGINT NOT NULL], t3_Id0=[$t263])
> >           MyEnumerableConverter
> >             MyTableScan(table=[[MySchema, t2]])
> >       EnumerableSort(sort0=[$0], dir0=[ASC])
> >         MyEnumerableConverter
> >           MyProject(Id=[$0])
> >             MyTableScan(table=[[MySchema, t3]])
> >
> > 1.29 (note the loss of the `MyProject` node above t3's MyTableScan, and
> now the inputs for the post-MergeJoin EnumerableCalc node are very wide
> compared to before)
> > EXPLAIN PLAN FOR
> > SELECT Id FROM t1
> > UNION
> > SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
> >
> -----------------------------------------------------------------------------------------------------
> > EnumerableUnion(all=[false])
> >   MyEnumerableConverter
> >     MyProject(Id=[$0])
> >       MyTableScan(table=[[MySchema, t1]])
> >   EnumerableCalc(expr#0..402=[{inputs}], Id=[$t264])
> >     EnumerableMergeJoin(condition=[=($263, $264)], joinType=[inner])
> >       EnumerableSort(sort0=[$263], dir0=[ASC])
> >         EnumerableCalc(expr#0..262=[{inputs}],
> expr#263=[CAST($t178):BIGINT NOT NULL], proj#0..263=[{exprs}])
> >           MyEnumerableConverter
> >             MyTableScan(table=[[MySchema, t2]])
> >       EnumerableSort(sort0=[$0], dir0=[ASC])
> >         MyEnumerableConverter
> >           MyTableScan(table=[[MySchema, t3]])
> >
> > I've tried removing various rules and even restricting them, but the
> custom Projects are still removed in the EnumerableMergeJoin despite
> setting the self-cost to zero.
> >
> > Am I missing something? Any suggestions on where to look next?
> >
> > Thanks for any assistance/tips.
> >
> > Best,
> > Zack
> >
> >
>

Re: [Help] UNION queries with one or more SELECTs containing a JOIN are significantly slower after 1.24

Posted by Haisheng Yuan <hy...@apache.org>.

Hi Zack,

Looks like it is a regression.
Are you able to provide a reproducible test case? You can log a JIRA along with the test case, so people can do the root cause analysis.

Thanks,
Haisheng Yuan

On 2022/03/12 01:26:53 "Gramana, Zachary (GE Digital)" wrote:
> Hello all!
> 
> I would appreciate some help/guidance with troubleshooting a big performance regression we saw on our UNION queries after updating our application from 1.24 to 1.29. Many of the tables in the queries are wide (> 50 columns) so we use a custom push-down Project class. These custom Projects are being eliminated in 1.29 UNION queries--though not when replacing UNION with UNION ALL.
> 
> The examples below illustrate minimal queries that reproduce the issue our test database.
> 
> When compared to the UNION query times using 1.24, the prepared UNION statements take 2-2.5 times longer in 1.29. The prepare time is around 2 times longer as well.
> 
> In the examples below, the UNION and UNION ALL queries generate the same result set in our test database. All Calcite tables have default collations ("ISO-8859-1$en_US"). Several are rather wide, with many > 50 columns.
> 
> In 1.29 for a simple query, both UNION and UNION ALL produce the same query plan, with the exception of EnumerableUnion's `all` parameter:
> 
> 1.29
> EXPLAIN PLAN FOR
> SELECT Id FROM t1
> UNION ALL
> SELECT Id FROM t2
> -----------------------------------------------------------------------------------------------------
> EnumerableUnion(all=[true])
>   MyEnumerableConverter
>     MyProject(Id=[$0])
>       MyTableScan(table=[[MySchema, t1]])
>   MyEnumerableConverter
>     MyProject(Id=[$0])
>       MyTableScan(table=[[MySchema, t2]])
> 
> 1.29
> EXPLAIN PLAN FOR
> SELECT Id FROM t1
> UNION
> SELECT Id FROM t2
> -----------------------------------------------------------------------------------------------------
> EnumerableUnion(all=[false])
>   MyEnumerableConverter
>     MyProject(Id=[$0])
>       MyTableScan(table=[[MySchema, t1]])
>   MyEnumerableConverter
>     MyProject(Id=[$0])
>       MyTableScan(table=[[MySchema, t2]])
> 
> The next examples add a JOIN to one of the tables in order to make this a slightly more complicated query and trigger the issue. Unlike in 1.24, the plans in 1.29 for UNION and UNION ALL now vary significantly:
> 
> 1.24 (like the above queries,  the UNION and UNION ALL plans differ only by the `all` parameter, so the UNION ALL plan for 1.24 is omitted)
> EXPLAIN PLAN FOR
> SELECT Id FROM t1
> UNION
> SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
> -----------------------------------------------------------------------------------------------------
> EnumerableUnion(all=[false])
>   MyEnumerableConverter
>     MyProject(Id=[$0])
>       MyTableScan(table=[[MySchema, t1]])
>   EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
>     EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
>       EnumerableSort(sort0=[$0], dir0=[ASC])
>         EnumerableCalc(expr#0..262=[{inputs}], expr#263=[CAST($t178):BIGINT NOT NULL], t3_Id0=[$t263])
>           MyEnumerableConverter
>             MyTableScan(table=[[MySchema, t2]])
>       EnumerableSort(sort0=[$0], dir0=[ASC])
>         MyEnumerableConverter
>           MyProject(Id=[$0])
>             MyTableScan(table=[[MySchema, t3]])
> 
> 1.29 (note this UNION ALL plan is nearly identical plan to the above `UNION` plan in 1.24)
> EXPLAIN PLAN FOR
> SELECT Id FROM t1
> UNION ALL
> SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
> -----------------------------------------------------------------------------------------------------
> EnumerableUnion(all=[true])
>   MyEnumerableConverter
>     MyProject(Id=[$0])
>       MyTableScan(table=[[MySchema, t1]])
>   EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
>     EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
>       EnumerableSort(sort0=[$0], dir0=[ASC])
>         EnumerableCalc(expr#0..262=[{inputs}], expr#263=[CAST($t178):BIGINT NOT NULL], t3_Id0=[$t263])
>           MyEnumerableConverter
>             MyTableScan(table=[[MySchema, t2]])
>       EnumerableSort(sort0=[$0], dir0=[ASC])
>         MyEnumerableConverter
>           MyProject(Id=[$0])
>             MyTableScan(table=[[MySchema, t3]])
> 
> 1.29 (note the loss of the `MyProject` node above t3's MyTableScan, and now the inputs for the post-MergeJoin EnumerableCalc node are very wide compared to before)
> EXPLAIN PLAN FOR
> SELECT Id FROM t1
> UNION
> SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
> -----------------------------------------------------------------------------------------------------
> EnumerableUnion(all=[false])
>   MyEnumerableConverter
>     MyProject(Id=[$0])
>       MyTableScan(table=[[MySchema, t1]])
>   EnumerableCalc(expr#0..402=[{inputs}], Id=[$t264])
>     EnumerableMergeJoin(condition=[=($263, $264)], joinType=[inner])
>       EnumerableSort(sort0=[$263], dir0=[ASC])
>         EnumerableCalc(expr#0..262=[{inputs}], expr#263=[CAST($t178):BIGINT NOT NULL], proj#0..263=[{exprs}])
>           MyEnumerableConverter
>             MyTableScan(table=[[MySchema, t2]])
>       EnumerableSort(sort0=[$0], dir0=[ASC])
>         MyEnumerableConverter
>           MyTableScan(table=[[MySchema, t3]])
> 
> I've tried removing various rules and even restricting them, but the custom Projects are still removed in the EnumerableMergeJoin despite setting the self-cost to zero.
> 
> Am I missing something? Any suggestions on where to look next?
> 
> Thanks for any assistance/tips.
> 
> Best,
> Zack
> 
>