You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by JiaTao Tao <ta...@gmail.com> on 2020/12/02 09:47:02 UTC

TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Hi fellows
My PR failed due to, seems an unstable ut, have anyone met this?

FAILURE 300.8sec, org.apache.calcite.adapter.tpch.TpchTest > testQuery07()
    java.util.concurrent.TimeoutException: testQuery07() timed out
after 5 minutes

https://github.com/apache/calcite/pull/2284

https://ci.appveyor.com/project/ApacheSoftwareFoundation/calcite/builds/36616339/job/hblxyoq0myx96d5u

Regards!

Aron Tao

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by JiaTao Tao <ta...@gmail.com>.
Hi
the flame graph is IDEA's feature, you can directly run with the flame
graph.
As I said 2222 does increase the time, but the time is almost from
getRelList, so if we can opt getRelList, the overhead is gone.

Regards!

Aron Tao


Vladimir Sitnikov <si...@gmail.com> 于2020年12月3日周四 下午9:02写道:

> >From the flame graph, the most time is wasted in "getRelList", I didn't
> see
>
> How do you collect and produce the flamegraphs?
> Do you use debugnonsafepoints Java option?
>
> Here are my measurements for running testQuery07:
>
> with 2222: 47..57 seconds
> without 2222: 40..45 seconds
>
> I do not know how much time does test initialization take (e.g. schema
> preparation), however, the difference seems to be visible.
>
> Vladimir
>

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by Vladimir Sitnikov <si...@gmail.com>.
>From the flame graph, the most time is wasted in "getRelList", I didn't see

How do you collect and produce the flamegraphs?
Do you use debugnonsafepoints Java option?

Here are my measurements for running testQuery07:

with 2222: 47..57 seconds
without 2222: 40..45 seconds

I do not know how much time does test initialization take (e.g. schema
preparation), however, the difference seems to be visible.

Vladimir

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by JiaTao Tao <ta...@gmail.com>.
Hi Vladimir

From the flame graph, the most time is wasted in "getRelList", I didn't see
"getRelList().contains(...)" takes time, maybe it's not cost performance as
we thought, if you do worry about it, I can store RelSubset#rels in Set.

Vladimir > It would be great to resolve those cases, however, until the
cases are resolved (never?),",
From the flame graph, we can see actually ProjectMergeRule takes the most
time and the alternate rels do huge, my point is we can take a look to see
if it is reasonable(and the case in pull/2283), not mean to reslove all
those cases.

Regards!

Aron Tao


Vladimir Sitnikov <si...@gmail.com> 于2020年12月3日周四 下午5:48写道:

JiaTao Tao, thanks for the details!

JiaTao Tao> found. Without pull/2222, some cases will fail due to "There
are not enough
JiaTao Tao> rules to produce a node with desired properties: There is 1
empty subset",
JiaTao Tao> the root cause is the outdated rule matches have been fired.

The comment in 2222 reads "rel has been removed from set".
Does that mean VolcanoPlanner#rename results in an inconsistent state of
the rels/sets?
If that is the case, I would suggest moving the fix to
VolcanoPlanner#rename.

JiaTao Tao>2. pull/2222 itself is a positive fix, it due increase the time
but the
JiaTao Tao>true problem is RelSubset#getRelList, we can improve pull/2222 by
JiaTao Tao>improving RelSubset#getRelList

If you can make getRelList fast without losing too much time in other cases,
that would be awesome.

However, the current implementation of getRelList is slow, so it should be
used with caution.

Even if you make getRelList() a O(1) operation, it won't resolve
getRelList().contains(...)
That is why I'm inclined that 2222 should not come alone.

JiaTao Tao>3. Calcite sometime will generate huge alternate rels,

That is true. One of the known cases was
https://issues.apache.org/jira/browse/CALCITE-2223
It would be great to resolve those cases, however, until the cases are
resolved (never?),
we should not rely on "the list of rules being small" or something like
that.

Vladimir

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by Vladimir Sitnikov <si...@gmail.com>.
JiaTao Tao, thanks for the details!

JiaTao Tao> found. Without pull/2222, some cases will fail due to "There
are not enough
JiaTao Tao> rules to produce a node with desired properties: There is 1
empty subset",
JiaTao Tao> the root cause is the outdated rule matches have been fired.

The comment in 2222 reads "rel has been removed from set".
Does that mean VolcanoPlanner#rename results in an inconsistent state of
the rels/sets?
If that is the case, I would suggest moving the fix to
VolcanoPlanner#rename.

JiaTao Tao>2. pull/2222 itself is a positive fix, it due increase the time
but the
JiaTao Tao>true problem is RelSubset#getRelList, we can improve pull/2222 by
JiaTao Tao>improving RelSubset#getRelList

If you can make getRelList fast without losing too much time in other cases,
that would be awesome.

However, the current implementation of getRelList is slow, so it should be
used with caution.

Even if you make getRelList() a O(1) operation, it won't resolve
getRelList().contains(...)
That is why I'm inclined that 2222 should not come alone.

JiaTao Tao>3. Calcite sometime will generate huge alternate rels,

That is true. One of the known cases was
https://issues.apache.org/jira/browse/CALCITE-2223
It would be great to resolve those cases, however, until the cases are
resolved (never?),
we should not rely on "the list of rules being small" or something like
that.

Vladimir

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by JiaTao Tao <ta...@gmail.com>.
Hi Vladimir

I got your opinion, may we do a quick fix about the performance, then I'll
take a look again about this.

Regards!

Aron Tao


Vladimir Sitnikov <si...@gmail.com> 于2020年12月4日周五 上午5:13写道:

The improvement to RelTraitSet.satisfies would be nice, however, I still
believe getRelList must not be used in 2222.

It looks like the only possibility of the case when rel points to a relset
which does not include the rel is the case when the rel
has been replaced with an equivalent.

In other words, the planner finds that the rel is equivalent to another
rel, and it removes the new one from its set.
Should we remove the entry from mapRel2Subset as well?
Then RuleCall could treat "missing entry in mapRel2Subset" as a sign of the
stale rel.

Vladimir

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by JiaTao Tao <ta...@gmail.com>.
Hi Julian, Vladimir

I remove the equiv rel( *mapRel2Subset.remove(rel)*) in
VolcanoPlanner#rename,
https://github.com/apache/calcite/pull/2290/files#diff-008c6d52bfd93bbe963a23c264bc412c68cac3b4837e3f10b8d5e4858cd4acb8
:

When I run the test again, it throws NPE in RuleQueue#skipMatch, in method
RuleQueue#checkDuplicateSubsets, we want a non-null subset: "*final
RelSubset subset = planner.getSubsetNonNull(rels[operand.ordinalInRule]);*"
Cuz we remove the rel from mapRel2Subset, so it failed here, here's my
questions.

1. How about I use getSubset in checkDuplicateSubsets, if it is null, just
breakdown, you can take a look at this PR:
https://github.com/apache/calcite/pull/2290/files.
2. Seems we will skip the match in multiple places(RuleQueue#skipMatch,
VolcanoRuleCall#onMatch), I found some work are duplicate like we all judge
the prunedNodes(0 importance), the logic seems not cohesive.

Regards!

Aron Tao


Julian Hyde <jh...@apache.org> 于2020年12月4日周五 上午11:49写道:

> I think Vladimir is very likely correct that the performance
> regression was due to algorithmic changes, that getRelList should not
> be used, and that improving its performance is not solving the real
> problem. So, I will not merge a fix for getRelList performance until
> after the algorithmic problem has been resolved.
>
> On Thu, Dec 3, 2020 at 6:54 PM JiaTao Tao <ta...@gmail.com> wrote:
> >
> > Hi Julian
> >
> > Thanks, Julian, your advice is really helpful! Do you mind I pull a
> request
> > to merge your code so that we can improve the performance very quickly?
> > About " stored the list of RelNodes ", do you concern about the duplicate
> > "relnode" store in "relsubset"? IMO, what we store is a reference, seems
> OK.
> >
> > Regards!
> >
> > Aron Tao
> >
> >
> > Vladimir Sitnikov <si...@gmail.com> 于2020年12月4日周五 上午5:13写道:
> >
> > The improvement to RelTraitSet.satisfies would be nice, however, I still
> > believe getRelList must not be used in 2222.
> >
> > It looks like the only possibility of the case when rel points to a
> relset
> > which does not include the rel is the case when the rel
> > has been replaced with an equivalent.
> >
> > In other words, the planner finds that the rel is equivalent to another
> > rel, and it removes the new one from its set.
> > Should we remove the entry from mapRel2Subset as well?
> > Then RuleCall could treat "missing entry in mapRel2Subset" as a sign of
> the
> > stale rel.
> >
> > Vladimir
>

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by JiaTao Tao <ta...@gmail.com>.
Hi Julian, Vladimir

Actually, I found TpchTest#testQuery07 can directly reproduce the
situation.  Put this in VolcanoRuleCall#match, delete the previous code,
and run the test, we can see the prints are full of the console(about 8k+).
I will take a look into this, to see if we can remove the outdated rel
in mapRel2Subset.

        if (subset != rel && !subset.getRelList().contains(rel)) {
          System.out.println("Rule [{}] not fired because operand #{}
({}) belongs to obsolete set");;
        }

Actually you can run TpchTest#testQuery03, the prints only occur 3 times,
more frindly for debug.


Regards!

Aron Tao


Julian Hyde <jh...@apache.org> 于2020年12月4日周五 上午11:49写道:

I think Vladimir is very likely correct that the performance
regression was due to algorithmic changes, that getRelList should not
be used, and that improving its performance is not solving the real
problem. So, I will not merge a fix for getRelList performance until
after the algorithmic problem has been resolved.

On Thu, Dec 3, 2020 at 6:54 PM JiaTao Tao <ta...@gmail.com> wrote:
>
> Hi Julian
>
> Thanks, Julian, your advice is really helpful! Do you mind I pull a
request
> to merge your code so that we can improve the performance very quickly?
> About " stored the list of RelNodes ", do you concern about the duplicate
> "relnode" store in "relsubset"? IMO, what we store is a reference, seems
OK.
>
> Regards!
>
> Aron Tao
>
>
> Vladimir Sitnikov <si...@gmail.com> 于2020年12月4日周五 上午5:13写道:
>
> The improvement to RelTraitSet.satisfies would be nice, however, I still
> believe getRelList must not be used in 2222.
>
> It looks like the only possibility of the case when rel points to a relset
> which does not include the rel is the case when the rel
> has been replaced with an equivalent.
>
> In other words, the planner finds that the rel is equivalent to another
> rel, and it removes the new one from its set.
> Should we remove the entry from mapRel2Subset as well?
> Then RuleCall could treat "missing entry in mapRel2Subset" as a sign of
the
> stale rel.
>
> Vladimir

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by Julian Hyde <jh...@apache.org>.
I think Vladimir is very likely correct that the performance
regression was due to algorithmic changes, that getRelList should not
be used, and that improving its performance is not solving the real
problem. So, I will not merge a fix for getRelList performance until
after the algorithmic problem has been resolved.

On Thu, Dec 3, 2020 at 6:54 PM JiaTao Tao <ta...@gmail.com> wrote:
>
> Hi Julian
>
> Thanks, Julian, your advice is really helpful! Do you mind I pull a request
> to merge your code so that we can improve the performance very quickly?
> About " stored the list of RelNodes ", do you concern about the duplicate
> "relnode" store in "relsubset"? IMO, what we store is a reference, seems OK.
>
> Regards!
>
> Aron Tao
>
>
> Vladimir Sitnikov <si...@gmail.com> 于2020年12月4日周五 上午5:13写道:
>
> The improvement to RelTraitSet.satisfies would be nice, however, I still
> believe getRelList must not be used in 2222.
>
> It looks like the only possibility of the case when rel points to a relset
> which does not include the rel is the case when the rel
> has been replaced with an equivalent.
>
> In other words, the planner finds that the rel is equivalent to another
> rel, and it removes the new one from its set.
> Should we remove the entry from mapRel2Subset as well?
> Then RuleCall could treat "missing entry in mapRel2Subset" as a sign of the
> stale rel.
>
> Vladimir

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by JiaTao Tao <ta...@gmail.com>.
Hi Julian

Thanks, Julian, your advice is really helpful! Do you mind I pull a request
to merge your code so that we can improve the performance very quickly?
About " stored the list of RelNodes ", do you concern about the duplicate
"relnode" store in "relsubset"? IMO, what we store is a reference, seems OK.

Regards!

Aron Tao


Vladimir Sitnikov <si...@gmail.com> 于2020年12月4日周五 上午5:13写道:

The improvement to RelTraitSet.satisfies would be nice, however, I still
believe getRelList must not be used in 2222.

It looks like the only possibility of the case when rel points to a relset
which does not include the rel is the case when the rel
has been replaced with an equivalent.

In other words, the planner finds that the rel is equivalent to another
rel, and it removes the new one from its set.
Should we remove the entry from mapRel2Subset as well?
Then RuleCall could treat "missing entry in mapRel2Subset" as a sign of the
stale rel.

Vladimir

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by Vladimir Sitnikov <si...@gmail.com>.
The improvement to RelTraitSet.satisfies would be nice, however, I still
believe getRelList must not be used in 2222.

It looks like the only possibility of the case when rel points to a relset
which does not include the rel is the case when the rel
has been replaced with an equivalent.

In other words, the planner finds that the rel is equivalent to another
rel, and it removes the new one from its set.
Should we remove the entry from mapRel2Subset as well?
Then RuleCall could treat "missing entry in mapRel2Subset" as a sign of the
stale rel.

Vladimir

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by Julian Hyde <jh...@apache.org>.
A long time ago, we stored the list of RelNodes in each RelSubset, and
changed to a shared list in RelSet. I still think it makes sense,
especially when subsets are not disjoint (e.g. if a set has one subset
sorted by (x) and another subset sorted by (x, y) then all of the
RelNodes in the latter will also be in the former subset).

A simple optimization to getRelList makes testQuery07 about 20% faster
(39s to 33s for me):

diff --git a/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java
b/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java
index b1e0b6322..7f7e8cbc9 100644
--- a/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java
+++ b/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java
@@ -399,7 +399,7 @@ RelNode buildCheapestPlan(VolcanoPlanner planner) {
   public List<RelNode> getRelList() {
     final List<RelNode> list = new ArrayList<>();
     for (RelNode rel : set.rels) {
-      if (rel.getTraitSet().satisfies(traitSet)) {
+      if (rel.getTraitSet() == traitSet ||
rel.getTraitSet().satisfies(traitSet)) {
         list.add(rel);
       }
     }

It's worth pushing down that logic into RelTraitSet.satisfies, so
other code will benefit. And maybe some further improvements to
RelTraitSet.satisfies and RelTrait.satisfies.

Julian

On Thu, Dec 3, 2020 at 6:03 AM JiaTao Tao <ta...@gmail.com> wrote:
>
> Hi Chunwei
> Sure, I'll organize a document to list the root clauses about the
> optimization StackOverflow and fire JIRAS, thank for your help.
>
> Regards!
>
> Aron Tao
>
>
> Chunwei Lei <ch...@gmail.com> 于2020年12月3日周四 下午4:01写道:
>
> > Thank you for your effort, Aron.
> >
> > I am wondering if you can file a jira issue and paste your findings.
> > I would like to look into it if I find some time.
> >
> > Best,
> > Chunwei
> >
> >
> > On Thu, Dec 3, 2020 at 1:15 PM JiaTao Tao <ta...@gmail.com> wrote:
> >
> > > Hi Vladimir
> > > Thanks for your reply:
> > > 1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after
> > > re-run the PR ut, the ut is passed, so this ut timeout may be an
> > occasional
> > > situation.
> > >
> > > 2. For the pull/2222, it due increase the time of TpchTest#testQuery07,
> > > about 23%, sorry for that, we test the feature in our product env(Calcite
> > > is our SQL facade in ByteDance, it process hundreds of thousands SQL per
> > > day), and we have monitor about optimization time, no obvious changes
> > were
> > > found. Without pull/2222, some cases will fail due to "There are not
> > enough
> > > rules to produce a node with desired properties: There is 1 empty
> > subset",
> > > the root cause is the outdated rule matches have been fired.
> > >
> > > Actually, I found actually the time was taken by RelSubset#getRelList, a
> > > fresh case I found yesterday, see
> > > https://github.com/apache/calcite/pull/2283, it OOM
> > > in VolcanoRuleCall.matchRecurse when calling getRelList(but the root
> > clause
> > > must be the deadlock between rules), so I think the pull/2222's fix is
> > not
> > > a problem, the true problem is in "getRelList", my proposal is we can
> > store
> > > the relList in RelSubSet like RelSet#rels.
> > >
> > > 3. For the case TpchTest#testQuery07, after optimization, the largest rel
> > > ID is increased to *900k*, and has *20k* rels remain in volcano planner,
> > > so it's pretty large. In fact, after profile, I found most of the time is
> > > taken by ProjectMergeRule, in our company's error cases(OOM, Stack
> > > Overflow) are also related to this rule, maybe we can see if we can
> > > optimize this.
> > >
> > > Finally, to summarize:
> > > 1. TpchTest#testQuery07 timeout is an occasional situation, normally it
> > > ends about 40~45s(in my MBP15 2019).
> > > 2. pull/2222 itself is a positive fix, it due increase the time but the
> > > true problem is RelSubset#getRelList, we can improve pull/2222 by
> > > improving RelSubset#getRelList, this not only benefits this, it
> > > makes getRelList a cheap function as its name, we can call this with
> > > no burden
> > > 3. Calcite sometime will generate huge alternate rels, it actually not
> > > reasonable most of the time, we've found many OOM/Stack Overflow/timeout,
> > > but the SQL is highly private, If you are interested, I can share the
> > > cases' root clause using blog or email, but I may have no time to
> > construct
> > > these cases. Let's workout together to decrease them.
> > >
> > > Regards!
> > >
> > > Aron Tao
> > >
> > >
> > > Vladimir Sitnikov <si...@gmail.com> 于2020年12月2日周三 下午10:01写道:
> > >
> > > > Hi,
> > > >
> > > > I've made a quick check, and the results are as follows:
> > > > 1) The test was enabled recently (2020-10-20) in
> > > >
> > > >
> > >
> > https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8
> > > > 2) One of the recent changes that do significantly impact the
> > performance
> > > > of the test is https://github.com/apache/calcite/pull/2222
> > > > 3) It is sad we commit performance-related fixes without taking
> > > performance
> > > > into consideration :-(
> > > >
> > > > Do you think you could look into 2222 again and check if the slowdown
> > was
> > > > expected?
> > > > Do you think you could improve 2222?
> > > > Do you think we should revert 2222?
> > > >
> > > > Vladimir
> > > >
> > >
> >

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by JiaTao Tao <ta...@gmail.com>.
Hi Chunwei
Sure, I'll organize a document to list the root clauses about the
optimization StackOverflow and fire JIRAS, thank for your help.

Regards!

Aron Tao


Chunwei Lei <ch...@gmail.com> 于2020年12月3日周四 下午4:01写道:

> Thank you for your effort, Aron.
>
> I am wondering if you can file a jira issue and paste your findings.
> I would like to look into it if I find some time.
>
> Best,
> Chunwei
>
>
> On Thu, Dec 3, 2020 at 1:15 PM JiaTao Tao <ta...@gmail.com> wrote:
>
> > Hi Vladimir
> > Thanks for your reply:
> > 1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after
> > re-run the PR ut, the ut is passed, so this ut timeout may be an
> occasional
> > situation.
> >
> > 2. For the pull/2222, it due increase the time of TpchTest#testQuery07,
> > about 23%, sorry for that, we test the feature in our product env(Calcite
> > is our SQL facade in ByteDance, it process hundreds of thousands SQL per
> > day), and we have monitor about optimization time, no obvious changes
> were
> > found. Without pull/2222, some cases will fail due to "There are not
> enough
> > rules to produce a node with desired properties: There is 1 empty
> subset",
> > the root cause is the outdated rule matches have been fired.
> >
> > Actually, I found actually the time was taken by RelSubset#getRelList, a
> > fresh case I found yesterday, see
> > https://github.com/apache/calcite/pull/2283, it OOM
> > in VolcanoRuleCall.matchRecurse when calling getRelList(but the root
> clause
> > must be the deadlock between rules), so I think the pull/2222's fix is
> not
> > a problem, the true problem is in "getRelList", my proposal is we can
> store
> > the relList in RelSubSet like RelSet#rels.
> >
> > 3. For the case TpchTest#testQuery07, after optimization, the largest rel
> > ID is increased to *900k*, and has *20k* rels remain in volcano planner,
> > so it's pretty large. In fact, after profile, I found most of the time is
> > taken by ProjectMergeRule, in our company's error cases(OOM, Stack
> > Overflow) are also related to this rule, maybe we can see if we can
> > optimize this.
> >
> > Finally, to summarize:
> > 1. TpchTest#testQuery07 timeout is an occasional situation, normally it
> > ends about 40~45s(in my MBP15 2019).
> > 2. pull/2222 itself is a positive fix, it due increase the time but the
> > true problem is RelSubset#getRelList, we can improve pull/2222 by
> > improving RelSubset#getRelList, this not only benefits this, it
> > makes getRelList a cheap function as its name, we can call this with
> > no burden
> > 3. Calcite sometime will generate huge alternate rels, it actually not
> > reasonable most of the time, we've found many OOM/Stack Overflow/timeout,
> > but the SQL is highly private, If you are interested, I can share the
> > cases' root clause using blog or email, but I may have no time to
> construct
> > these cases. Let's workout together to decrease them.
> >
> > Regards!
> >
> > Aron Tao
> >
> >
> > Vladimir Sitnikov <si...@gmail.com> 于2020年12月2日周三 下午10:01写道:
> >
> > > Hi,
> > >
> > > I've made a quick check, and the results are as follows:
> > > 1) The test was enabled recently (2020-10-20) in
> > >
> > >
> >
> https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8
> > > 2) One of the recent changes that do significantly impact the
> performance
> > > of the test is https://github.com/apache/calcite/pull/2222
> > > 3) It is sad we commit performance-related fixes without taking
> > performance
> > > into consideration :-(
> > >
> > > Do you think you could look into 2222 again and check if the slowdown
> was
> > > expected?
> > > Do you think you could improve 2222?
> > > Do you think we should revert 2222?
> > >
> > > Vladimir
> > >
> >
>

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by Chunwei Lei <ch...@gmail.com>.
Thank you for your effort, Aron.

I am wondering if you can file a jira issue and paste your findings.
I would like to look into it if I find some time.

Best,
Chunwei


On Thu, Dec 3, 2020 at 1:15 PM JiaTao Tao <ta...@gmail.com> wrote:

> Hi Vladimir
> Thanks for your reply:
> 1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after
> re-run the PR ut, the ut is passed, so this ut timeout may be an occasional
> situation.
>
> 2. For the pull/2222, it due increase the time of TpchTest#testQuery07,
> about 23%, sorry for that, we test the feature in our product env(Calcite
> is our SQL facade in ByteDance, it process hundreds of thousands SQL per
> day), and we have monitor about optimization time, no obvious changes were
> found. Without pull/2222, some cases will fail due to "There are not enough
> rules to produce a node with desired properties: There is 1 empty subset",
> the root cause is the outdated rule matches have been fired.
>
> Actually, I found actually the time was taken by RelSubset#getRelList, a
> fresh case I found yesterday, see
> https://github.com/apache/calcite/pull/2283, it OOM
> in VolcanoRuleCall.matchRecurse when calling getRelList(but the root clause
> must be the deadlock between rules), so I think the pull/2222's fix is not
> a problem, the true problem is in "getRelList", my proposal is we can store
> the relList in RelSubSet like RelSet#rels.
>
> 3. For the case TpchTest#testQuery07, after optimization, the largest rel
> ID is increased to *900k*, and has *20k* rels remain in volcano planner,
> so it's pretty large. In fact, after profile, I found most of the time is
> taken by ProjectMergeRule, in our company's error cases(OOM, Stack
> Overflow) are also related to this rule, maybe we can see if we can
> optimize this.
>
> Finally, to summarize:
> 1. TpchTest#testQuery07 timeout is an occasional situation, normally it
> ends about 40~45s(in my MBP15 2019).
> 2. pull/2222 itself is a positive fix, it due increase the time but the
> true problem is RelSubset#getRelList, we can improve pull/2222 by
> improving RelSubset#getRelList, this not only benefits this, it
> makes getRelList a cheap function as its name, we can call this with
> no burden
> 3. Calcite sometime will generate huge alternate rels, it actually not
> reasonable most of the time, we've found many OOM/Stack Overflow/timeout,
> but the SQL is highly private, If you are interested, I can share the
> cases' root clause using blog or email, but I may have no time to construct
> these cases. Let's workout together to decrease them.
>
> Regards!
>
> Aron Tao
>
>
> Vladimir Sitnikov <si...@gmail.com> 于2020年12月2日周三 下午10:01写道:
>
> > Hi,
> >
> > I've made a quick check, and the results are as follows:
> > 1) The test was enabled recently (2020-10-20) in
> >
> >
> https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8
> > 2) One of the recent changes that do significantly impact the performance
> > of the test is https://github.com/apache/calcite/pull/2222
> > 3) It is sad we commit performance-related fixes without taking
> performance
> > into consideration :-(
> >
> > Do you think you could look into 2222 again and check if the slowdown was
> > expected?
> > Do you think you could improve 2222?
> > Do you think we should revert 2222?
> >
> > Vladimir
> >
>

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by JiaTao Tao <ta...@gmail.com>.
Hi Vladimir
Thanks for your reply:
1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after
re-run the PR ut, the ut is passed, so this ut timeout may be an occasional
situation.

2. For the pull/2222, it due increase the time of TpchTest#testQuery07,
about 23%, sorry for that, we test the feature in our product env(Calcite
is our SQL facade in ByteDance, it process hundreds of thousands SQL per
day), and we have monitor about optimization time, no obvious changes were
found. Without pull/2222, some cases will fail due to "There are not enough
rules to produce a node with desired properties: There is 1 empty subset",
the root cause is the outdated rule matches have been fired.

Actually, I found actually the time was taken by RelSubset#getRelList, a
fresh case I found yesterday, see
https://github.com/apache/calcite/pull/2283, it OOM
in VolcanoRuleCall.matchRecurse when calling getRelList(but the root clause
must be the deadlock between rules), so I think the pull/2222's fix is not
a problem, the true problem is in "getRelList", my proposal is we can store
the relList in RelSubSet like RelSet#rels.

3. For the case TpchTest#testQuery07, after optimization, the largest rel
ID is increased to *900k*, and has *20k* rels remain in volcano planner,
so it's pretty large. In fact, after profile, I found most of the time is
taken by ProjectMergeRule, in our company's error cases(OOM, Stack
Overflow) are also related to this rule, maybe we can see if we can
optimize this.

Finally, to summarize:
1. TpchTest#testQuery07 timeout is an occasional situation, normally it
ends about 40~45s(in my MBP15 2019).
2. pull/2222 itself is a positive fix, it due increase the time but the
true problem is RelSubset#getRelList, we can improve pull/2222 by
improving RelSubset#getRelList, this not only benefits this, it
makes getRelList a cheap function as its name, we can call this with
no burden
3. Calcite sometime will generate huge alternate rels, it actually not
reasonable most of the time, we've found many OOM/Stack Overflow/timeout,
but the SQL is highly private, If you are interested, I can share the
cases' root clause using blog or email, but I may have no time to construct
these cases. Let's workout together to decrease them.

Regards!

Aron Tao


Vladimir Sitnikov <si...@gmail.com> 于2020年12月2日周三 下午10:01写道:

> Hi,
>
> I've made a quick check, and the results are as follows:
> 1) The test was enabled recently (2020-10-20) in
>
> https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8
> 2) One of the recent changes that do significantly impact the performance
> of the test is https://github.com/apache/calcite/pull/2222
> 3) It is sad we commit performance-related fixes without taking performance
> into consideration :-(
>
> Do you think you could look into 2222 again and check if the slowdown was
> expected?
> Do you think you could improve 2222?
> Do you think we should revert 2222?
>
> Vladimir
>

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Posted by Vladimir Sitnikov <si...@gmail.com>.
Hi,

I've made a quick check, and the results are as follows:
1) The test was enabled recently (2020-10-20) in
https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8
2) One of the recent changes that do significantly impact the performance
of the test is https://github.com/apache/calcite/pull/2222
3) It is sad we commit performance-related fixes without taking performance
into consideration :-(

Do you think you could look into 2222 again and check if the slowdown was
expected?
Do you think you could improve 2222?
Do you think we should revert 2222?

Vladimir