You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Julian Hyde <jh...@apache.org> on 2018/09/09 20:22:53 UTC

Commit 88f125541, "Reduce HepPlannerTest#testRuleApplyCount complexity"

Vladimir,

I strongly disagree with your commit 88f125541. You have removed test
code that is useful in creating a high-quality product. I presume that
your goal is to make the test suite run a little faster.

I think you should back out your commit.

I sent a previous email on the subject but you did not respond.

Julian

Re: Commit 88f125541, "Reduce HepPlannerTest#testRuleApplyCount complexity"

Posted by Julian Hyde <jh...@apache.org>.
The technical justification is that the code — yes, sql is code — that you removed might have found a bug someday. A deep relational algebra tree places stresses on hep planner (and other parts of the system) that we cannot predict. The (implicit) expected behavior is that it doesn’t crash, produces the same result as yesterday, doesn’t use excessive memory, and finishes in a reasonable time. We test all of those things every time we run the test.

You are absolutely right that Calcite needs more testing, in many different categories. But we have no resources to write and maintain tests, so we have to do our best.

For example, I would LOVE to do performance regression tests. I know that the right way to do performance regression tests is to run the same test every day in the same controlled environment, measure its performance, and detect deviations over time. But we don’t have the resources to do that. All way have is a few tests that work on inputs that are big and complex that if something breaks the delta will be so large that someone will be bound to notice it. One of these tests you just gutted.

I didn’t read the rest of your email.

Julian


> On Sep 9, 2018, at 2:57 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> Julian>You seem to believe that the only purpose of a test is to test the
> specific bug or change for which it was introduced
> 
> I'm afraid you seem to put words in my mouth.
> 
> Note: so far I have technical reasons to keep the test with reduced number
> of lines while you don't.
> 
> Note2: original test included SQL with 100 or so lines. Somehow that
> pleases you. I've reduced the test to 10 or so, and it pisses you off.
> Could you please provide a technical clarification/justification on what
> number of SQL lines is enough in that test?
> I'm sure you can't, so could we just agree that you over-reacted on "commit
> 88f125541"?
> 
> You might ignore the below part, however you might learn a bit on
> property-based testing and/or fuzzy-testing and/or performance testing.
> 
> 
> Of course a test method can test zillions of scenarios "by accident".
> 
> However, each test should have some EXPECTED behaviour.
> For instance:
> a) "the number of HEP applied rules equals to 42 union all union all query"
> b) "the number of HEP applied rules is less when DEPTH_FIRST order is used
> for union all union all query"
> b*) "the number of HEP applied rules is less when DEPTH_FIRST order is used
> for ANY query"
> c) "the number of HEP applied rules grows linearly as the number of unions
> grows"
> d) "planning time grows linearly as the number of unions grows"
> e) "planning time fits in X seconds for query of size Y at CPU of Z GHz"
> f) "Calcite does not format hard drive when execute union all union all
> query"
> g) "Calcite is able to sustain load for at least 1 week and avoid out of
> memory, performance degradations, etc, etc"
> h) "your pick"
> 
> I find risks a..g quite important for Calcite.
> 
> I'm sure testRuleApplyCount can cover cases a and b only, and for those
> cases it does not require to process 1000 unions.
> 
> a is straight-forward. It does not require to have 1000'000 iterations in
> order to ensure "a".
> b is straight-forward as well as long as we are dealing with single SQL.
> 
> b*, c is not that trivial to prove since queries might have different size
> and shape.
> Property-based testing suits here: one can generate a query, apply two
> different HEP programs and compare if DEPTH_FIRST produces result faster.
> Current test verifies just a single SQL shape, however property-based test
> could pick random shape and size, and it could validate new SQL every time.
> Note: that randomization could still fit in 5 seconds, however those 5
> seconds would be multiplied by the number of test-suite executions, so it
> would result in quite good coverage.
> Old code of testRuleApplyCount massaged just a single SQL again and again,
> so it very unlikely to find a true bug.
> 
> CALCITE-2506
> <https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2506> is
> and example of bug identified by fuzzing + property-based testing.
> 
> d and e
> 
> performance tests (e.g. measurement of planning duration) require more than
> a single point.
> You can't measure performance by executing a single test on a random
> hardware.
> One would likely require isolated hardware as there's no way to measure
> responses time on a shared environment.
> If planning sequence like in testRuleApplyCount is truly important, it is a
> good candidate for inclusion to ubenchmark suite. There we could
> use @Param({1, 10, 100, 1000}) numberOfUnions and see how response time
> grows as number of unions is changed.
> 
> 
> f) Calcite does not format hard drive...
> The tests for unexpected behaviour are better be done via fuzzing.
> That is one could craft a randomized SQL, pass it to Calcite (with
> randomized set of activated rules) and see what happens.
> OutOfMemory and/or StackOverflow would be a strong sign of a bug in Calcite.
> 
> Here's an example of such issues: CALCITE-2527
> <https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2527>,
> CALCITE-2526
> <https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2526>
> 
> g) "Calcite is able to sustain load for at least 1 week and avoid out of
> memory, performance degradations, etc, etc"
> 
> This is called endurance testing, and the way to test there is to run a
> test for a week.
> You can't emulate that in a unit-test that has to complete in a reasonable
> time. "1 week" is not reasonable duration for unit test execution.
> 
> Vladimir


Re: Commit 88f125541, "Reduce HepPlannerTest#testRuleApplyCount complexity"

Posted by Vladimir Sitnikov <si...@gmail.com>.
Julian>You seem to believe that the only purpose of a test is to test the
specific bug or change for which it was introduced

I'm afraid you seem to put words in my mouth.

Note: so far I have technical reasons to keep the test with reduced number
of lines while you don't.

Note2: original test included SQL with 100 or so lines. Somehow that
pleases you. I've reduced the test to 10 or so, and it pisses you off.
Could you please provide a technical clarification/justification on what
number of SQL lines is enough in that test?
I'm sure you can't, so could we just agree that you over-reacted on "commit
88f125541"?

You might ignore the below part, however you might learn a bit on
property-based testing and/or fuzzy-testing and/or performance testing.


Of course a test method can test zillions of scenarios "by accident".

However, each test should have some EXPECTED behaviour.
For instance:
a) "the number of HEP applied rules equals to 42 union all union all query"
b) "the number of HEP applied rules is less when DEPTH_FIRST order is used
for union all union all query"
b*) "the number of HEP applied rules is less when DEPTH_FIRST order is used
for ANY query"
c) "the number of HEP applied rules grows linearly as the number of unions
grows"
d) "planning time grows linearly as the number of unions grows"
e) "planning time fits in X seconds for query of size Y at CPU of Z GHz"
f) "Calcite does not format hard drive when execute union all union all
query"
g) "Calcite is able to sustain load for at least 1 week and avoid out of
memory, performance degradations, etc, etc"
h) "your pick"

I find risks a..g quite important for Calcite.

I'm sure testRuleApplyCount can cover cases a and b only, and for those
cases it does not require to process 1000 unions.

a is straight-forward. It does not require to have 1000'000 iterations in
order to ensure "a".
b is straight-forward as well as long as we are dealing with single SQL.

b*, c is not that trivial to prove since queries might have different size
and shape.
Property-based testing suits here: one can generate a query, apply two
different HEP programs and compare if DEPTH_FIRST produces result faster.
Current test verifies just a single SQL shape, however property-based test
could pick random shape and size, and it could validate new SQL every time.
Note: that randomization could still fit in 5 seconds, however those 5
seconds would be multiplied by the number of test-suite executions, so it
would result in quite good coverage.
Old code of testRuleApplyCount massaged just a single SQL again and again,
so it very unlikely to find a true bug.

CALCITE-2506
<https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2506> is
and example of bug identified by fuzzing + property-based testing.

d and e

performance tests (e.g. measurement of planning duration) require more than
a single point.
You can't measure performance by executing a single test on a random
hardware.
One would likely require isolated hardware as there's no way to measure
responses time on a shared environment.
If planning sequence like in testRuleApplyCount is truly important, it is a
good candidate for inclusion to ubenchmark suite. There we could
use @Param({1, 10, 100, 1000}) numberOfUnions and see how response time
grows as number of unions is changed.


f) Calcite does not format hard drive...
The tests for unexpected behaviour are better be done via fuzzing.
That is one could craft a randomized SQL, pass it to Calcite (with
randomized set of activated rules) and see what happens.
OutOfMemory and/or StackOverflow would be a strong sign of a bug in Calcite.

Here's an example of such issues: CALCITE-2527
<https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2527>,
CALCITE-2526
<https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2526>

g) "Calcite is able to sustain load for at least 1 week and avoid out of
memory, performance degradations, etc, etc"

This is called endurance testing, and the way to test there is to run a
test for a week.
You can't emulate that in a unit-test that has to complete in a reasonable
time. "1 week" is not reasonable duration for unit test execution.

Vladimir

Re: Commit 88f125541, "Reduce HepPlannerTest#testRuleApplyCount complexity"

Posted by Julian Hyde <jh...@apache.org>.
You seem to believe that the only purpose of a test is to test the specific bug or change for which it was introduced. That is indeed the philosophy of test-driven development, but there are more things in heaven and earth than TDD.

Calcite is a complex project, and needs complex tests to shake out complex issues. We do not have a QA team to write tests. Therefore the tests will be written during the course of fixing bugs.

Julian


> On Sep 9, 2018, at 1:48 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> Julian>I strongly disagree with your commit 88f125541. You have removed test
> code that is useful in creating a high-quality product.
> 
> Could you please provide technical justification?
> 
> Note: the test is still there.
> Note2: the test still runs in both Travis and Apache Jenkins. The test
> still runs from mvn test by default.
> 
> I'm inclined to suppose your "removed test code" wording is not quite right.
> 
> Note2: it still verifies that ARBITRARY order results in more rule
> applications than DEPTH_FIRST.
> Note3: original test code did not verify that the result of Hep planning
> matched to the expected one.
> 
> Juilan> I think you should back out your commit.
> 
> I think I should not.
> 
> Julina>I sent a previous email on the subject but you did not respond.
> 
> I'm afraid I fail to see which response do you expect.
> My opinion on the testRuleApplyCount test is as follows:
> a) It is useful to compare ARBITRARY and DEPTH_FIRST implementations.
> DEPTH_FIRST wins there, and it does NOT require to do a 1000'000 iterations
> to prove that.
> That is why I reduced test complexity so a unit does not exceed 5 seconds.
> 
> b) The SQL+Hep program result in bad scalability for Calcite. In other
> words, as the number of unions grows, planning becomes very slow. While it
> is interesting, and it might be even a bug in Calcite, I do not see how
> executing such a test (on every Travis build, on every Jenkins build, and
> on every dev build) helps Calcite development.
> 
> I can add a benchmark that would run that test with various numbers of
> unions and print the response time.
> However, we should use Apache and Travis' resources wisely and we should
> refrain from burning CPU for little reason.
> 
> I see your passion to build stable and robust software (that is great!),
> however I'm quite sure HepPlannerTest#testRuleApplyCount is not something
> that provides extra safety.
> 
> Vladimir


Re: Commit 88f125541, "Reduce HepPlannerTest#testRuleApplyCount complexity"

Posted by Vladimir Sitnikov <si...@gmail.com>.
Julian>I strongly disagree with your commit 88f125541. You have removed test
code that is useful in creating a high-quality product.

Could you please provide technical justification?

Note: the test is still there.
Note2: the test still runs in both Travis and Apache Jenkins. The test
still runs from mvn test by default.

I'm inclined to suppose your "removed test code" wording is not quite right.

Note2: it still verifies that ARBITRARY order results in more rule
applications than DEPTH_FIRST.
Note3: original test code did not verify that the result of Hep planning
matched to the expected one.

Juilan> I think you should back out your commit.

I think I should not.

Julina>I sent a previous email on the subject but you did not respond.

I'm afraid I fail to see which response do you expect.
My opinion on the testRuleApplyCount test is as follows:
a) It is useful to compare ARBITRARY and DEPTH_FIRST implementations.
DEPTH_FIRST wins there, and it does NOT require to do a 1000'000 iterations
to prove that.
That is why I reduced test complexity so a unit does not exceed 5 seconds.

b) The SQL+Hep program result in bad scalability for Calcite. In other
words, as the number of unions grows, planning becomes very slow. While it
is interesting, and it might be even a bug in Calcite, I do not see how
executing such a test (on every Travis build, on every Jenkins build, and
on every dev build) helps Calcite development.

I can add a benchmark that would run that test with various numbers of
unions and print the response time.
However, we should use Apache and Travis' resources wisely and we should
refrain from burning CPU for little reason.

I see your passion to build stable and robust software (that is great!),
however I'm quite sure HepPlannerTest#testRuleApplyCount is not something
that provides extra safety.

Vladimir