You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/01/27 11:54:20 UTC
[GitHub] [tvm-rfcs] Mousius commented on a change in pull request #55: @slow test RFC

Mousius commented on a change in pull request #55:
URL: https://github.com/apache/tvm-rfcs/pull/55#discussion_r793513849



##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.

Review comment:
       ```suggestion
   A small subset of tests take up a large portion of the total test runtime in Pull Requests (PRs). This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on `main`.
   ```

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):

Review comment:
       Can we include a snippet of this table, maybe the top 5 rows just in case you decide to clean up your gists one day? :smile_cat: 

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):

Review comment:
       ```suggestion
   The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order, showing the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):
   ```

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):
+
+```
+[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
+[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
+[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
+[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
+[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
+[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
+[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
+[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
+[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
+[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests
+```
+
+# Reference-level explanation
+
+[reference-level-explanation]: #reference-level-explanation
+
+A decorator `@tvm.testing.slow` will be added (see apache/tvm#10057) that implements
+the above behavior. Skipping slow tests would be an opt-in, rather than opt-out.
+This way developers who don't read this RFC won't have to adjust their workflows
+at all to run these tests locally. There is also a need to run slow tests on PRs
+in some cases, such as fixing reverted commits or if a developer suspects their
+change would have a wide reaching impact. In this case, `@ci run slow tests` can
+be added to the PR body before tests are run in order to disable skipping slow tests.
+
+Using a decorator has the advantage of being explicit compared to an automated system to detect and skip slow tests. Clicking through to the decorator’s short definition makes it clear what is going on so this shouldn’t add too much of a development burden. Slow tests run by default to minimize disruption but can be controlled by setting `SKIP_SLOW_TESTS=1`.
+
+An environment variable `SKIP_SLOW_TESTS=1` will be set in Jenkins on PRs. Branches,
+including `main` and `ci-docker-staging` will not have this flag set and will
+always run the full set of tests.
+
+# Drawbacks
+
+[drawbacks]: #drawbacks
+
+The primary caveat is that tests that run on `main` may now fail due to PRs that were green when they were merged, so this will require some buy-in from all TVM developers. However, the runtime savings are significant (see below) enough to make this worth it. Developments like apache/tvm#9554 will make the revert process much smoother as well to minimize disruptions.
+
+# Rationale and alternatives
+
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+This isn't a complete solution. Most PRs end up running lots of tests that the
+PR didn't affect at all. Ideally we would have a way a-la Bazel or Buck to
+walk the actual dependency graph of PRs (ignoring / banning some of Python's
+dynamism e.g. `importlib.import_module`) to determine what tests to run on a PR.
+This could also be implemented at the human level, with developers tagging their
+PRs based on what they think should run, though this has a higher potential to
+miss certain tests. However, this run-what-changed future would be difficult to
+achieve.
+
+Other efforts involve looking into tests themselves to determine why they are slow.
+Often TVM's tests are running much more work than they actually intend to test in
+more of an integration test than a unit test. Replacing these types of test with
+a framework that makes it easier to test TVM passes and functionality in smaller
+chunks is related but orthogonal to this work. It still has the same issue (coverage

Review comment:
       How is coverage still reduced if we put in place the better testing practices?

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):
+
+```
+[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
+[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
+[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
+[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
+[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
+[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
+[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
+[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
+[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
+[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests
+```
+
+# Reference-level explanation
+
+[reference-level-explanation]: #reference-level-explanation
+
+A decorator `@tvm.testing.slow` will be added (see apache/tvm#10057) that implements
+the above behavior. Skipping slow tests would be an opt-in, rather than opt-out.
+This way developers who don't read this RFC won't have to adjust their workflows
+at all to run these tests locally. There is also a need to run slow tests on PRs
+in some cases, such as fixing reverted commits or if a developer suspects their
+change would have a wide reaching impact. In this case, `@ci run slow tests` can
+be added to the PR body before tests are run in order to disable skipping slow tests.
+
+Using a decorator has the advantage of being explicit compared to an automated system to detect and skip slow tests. Clicking through to the decorator’s short definition makes it clear what is going on so this shouldn’t add too much of a development burden. Slow tests run by default to minimize disruption but can be controlled by setting `SKIP_SLOW_TESTS=1`.

Review comment:
       Maybe mention that this can be extended to C++ tests using the same trigger variable?

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):
+
+```
+[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
+[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
+[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
+[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
+[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
+[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
+[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
+[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
+[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
+[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests
+```
+
+# Reference-level explanation
+
+[reference-level-explanation]: #reference-level-explanation
+
+A decorator `@tvm.testing.slow` will be added (see apache/tvm#10057) that implements
+the above behavior. Skipping slow tests would be an opt-in, rather than opt-out.
+This way developers who don't read this RFC won't have to adjust their workflows
+at all to run these tests locally. There is also a need to run slow tests on PRs
+in some cases, such as fixing reverted commits or if a developer suspects their
+change would have a wide reaching impact. In this case, `@ci run slow tests` can
+be added to the PR body before tests are run in order to disable skipping slow tests.
+
+Using a decorator has the advantage of being explicit compared to an automated system to detect and skip slow tests. Clicking through to the decorator’s short definition makes it clear what is going on so this shouldn’t add too much of a development burden. Slow tests run by default to minimize disruption but can be controlled by setting `SKIP_SLOW_TESTS=1`.
+
+An environment variable `SKIP_SLOW_TESTS=1` will be set in Jenkins on PRs. Branches,
+including `main` and `ci-docker-staging` will not have this flag set and will
+always run the full set of tests.
+
+# Drawbacks
+
+[drawbacks]: #drawbacks
+
+The primary caveat is that tests that run on `main` may now fail due to PRs that were green when they were merged, so this will require some buy-in from all TVM developers. However, the runtime savings are significant (see below) enough to make this worth it. Developments like apache/tvm#9554 will make the revert process much smoother as well to minimize disruptions.
+
+# Rationale and alternatives
+
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+This isn't a complete solution. Most PRs end up running lots of tests that the
+PR didn't affect at all. Ideally we would have a way a-la Bazel or Buck to
+walk the actual dependency graph of PRs (ignoring / banning some of Python's
+dynamism e.g. `importlib.import_module`) to determine what tests to run on a PR.
+This could also be implemented at the human level, with developers tagging their
+PRs based on what they think should run, though this has a higher potential to
+miss certain tests. However, this run-what-changed future would be difficult to
+achieve.
+
+Other efforts involve looking into tests themselves to determine why they are slow.
+Often TVM's tests are running much more work than they actually intend to test in

Review comment:
       ```suggestion
   Often TVM's tests are running much more work than they actually intend to test (such as using entire off-the-shelf networks to test a few operators) in
   ```

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):
+
+```
+[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
+[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
+[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
+[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
+[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
+[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
+[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
+[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
+[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
+[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests
+```
+
+# Reference-level explanation
+
+[reference-level-explanation]: #reference-level-explanation
+
+A decorator `@tvm.testing.slow` will be added (see apache/tvm#10057) that implements
+the above behavior. Skipping slow tests would be an opt-in, rather than opt-out.
+This way developers who don't read this RFC won't have to adjust their workflows
+at all to run these tests locally. There is also a need to run slow tests on PRs
+in some cases, such as fixing reverted commits or if a developer suspects their
+change would have a wide reaching impact. In this case, `@ci run slow tests` can
+be added to the PR body before tests are run in order to disable skipping slow tests.
+
+Using a decorator has the advantage of being explicit compared to an automated system to detect and skip slow tests. Clicking through to the decorator’s short definition makes it clear what is going on so this shouldn’t add too much of a development burden. Slow tests run by default to minimize disruption but can be controlled by setting `SKIP_SLOW_TESTS=1`.
+
+An environment variable `SKIP_SLOW_TESTS=1` will be set in Jenkins on PRs. Branches,
+including `main` and `ci-docker-staging` will not have this flag set and will
+always run the full set of tests.
+
+# Drawbacks
+
+[drawbacks]: #drawbacks
+
+The primary caveat is that tests that run on `main` may now fail due to PRs that were green when they were merged, so this will require some buy-in from all TVM developers. However, the runtime savings are significant (see below) enough to make this worth it. Developments like apache/tvm#9554 will make the revert process much smoother as well to minimize disruptions.
+
+# Rationale and alternatives
+
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+This isn't a complete solution. Most PRs end up running lots of tests that the
+PR didn't affect at all. Ideally we would have a way a-la Bazel or Buck to

Review comment:
       The Python parallel would be `testmon` as far as I know, may be a better reference? Bazel has a whole host of other idiosyncrasy to deal with. 

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.

Review comment:
       ```suggestion
   CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within ([apache/tvm#9834](https://github.com/apache/tvm/pull/9834)) or between ([apache/tvm#9733](https://github.com/apache/tvm/pull/9733)) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
   ```

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):
+
+```
+[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
+[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
+[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
+[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
+[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
+[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
+[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
+[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
+[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
+[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests
+```
+
+# Reference-level explanation
+
+[reference-level-explanation]: #reference-level-explanation
+
+A decorator `@tvm.testing.slow` will be added (see apache/tvm#10057) that implements

Review comment:
       ```suggestion
   A decorator `@tvm.testing.slow` will be added (see [apache/tvm#10057](https://github.com/apache/tvm/pull/10057)) that implements
   ```

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):
+
+```
+[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
+[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
+[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
+[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
+[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
+[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
+[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
+[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
+[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
+[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests
+```
+
+# Reference-level explanation
+
+[reference-level-explanation]: #reference-level-explanation
+
+A decorator `@tvm.testing.slow` will be added (see apache/tvm#10057) that implements
+the above behavior. Skipping slow tests would be an opt-in, rather than opt-out.
+This way developers who don't read this RFC won't have to adjust their workflows
+at all to run these tests locally. There is also a need to run slow tests on PRs
+in some cases, such as fixing reverted commits or if a developer suspects their
+change would have a wide reaching impact. In this case, `@ci run slow tests` can
+be added to the PR body before tests are run in order to disable skipping slow tests.
+
+Using a decorator has the advantage of being explicit compared to an automated system to detect and skip slow tests. Clicking through to the decorator’s short definition makes it clear what is going on so this shouldn’t add too much of a development burden. Slow tests run by default to minimize disruption but can be controlled by setting `SKIP_SLOW_TESTS=1`.
+
+An environment variable `SKIP_SLOW_TESTS=1` will be set in Jenkins on PRs. Branches,
+including `main` and `ci-docker-staging` will not have this flag set and will
+always run the full set of tests.
+
+# Drawbacks
+
+[drawbacks]: #drawbacks
+
+The primary caveat is that tests that run on `main` may now fail due to PRs that were green when they were merged, so this will require some buy-in from all TVM developers. However, the runtime savings are significant (see below) enough to make this worth it. Developments like apache/tvm#9554 will make the revert process much smoother as well to minimize disruptions.

Review comment:
       ```suggestion
   The primary caveat is that tests that run on `main` may now fail due to PRs that were green when they were merged, so this will require some buy-in from all TVM developers. However, the runtime savings are significant (see below) enough to make this worth it. Developments like [apache/tvm#9554](https://github.com/apache/tvm/pull/9554) will make the revert process much smoother as well to minimize disruptions.
   ```

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):
+
+```
+[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
+[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
+[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
+[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
+[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
+[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
+[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
+[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
+[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
+[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests
+```
+
+# Reference-level explanation
+
+[reference-level-explanation]: #reference-level-explanation
+
+A decorator `@tvm.testing.slow` will be added (see apache/tvm#10057) that implements
+the above behavior. Skipping slow tests would be an opt-in, rather than opt-out.
+This way developers who don't read this RFC won't have to adjust their workflows
+at all to run these tests locally. There is also a need to run slow tests on PRs
+in some cases, such as fixing reverted commits or if a developer suspects their
+change would have a wide reaching impact. In this case, `@ci run slow tests` can
+be added to the PR body before tests are run in order to disable skipping slow tests.
+
+Using a decorator has the advantage of being explicit compared to an automated system to detect and skip slow tests. Clicking through to the decorator’s short definition makes it clear what is going on so this shouldn’t add too much of a development burden. Slow tests run by default to minimize disruption but can be controlled by setting `SKIP_SLOW_TESTS=1`.
+
+An environment variable `SKIP_SLOW_TESTS=1` will be set in Jenkins on PRs. Branches,
+including `main` and `ci-docker-staging` will not have this flag set and will
+always run the full set of tests.
+
+# Drawbacks
+
+[drawbacks]: #drawbacks
+
+The primary caveat is that tests that run on `main` may now fail due to PRs that were green when they were merged, so this will require some buy-in from all TVM developers. However, the runtime savings are significant (see below) enough to make this worth it. Developments like apache/tvm#9554 will make the revert process much smoother as well to minimize disruptions.
+
+# Rationale and alternatives
+
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+This isn't a complete solution. Most PRs end up running lots of tests that the
+PR didn't affect at all. Ideally we would have a way a-la Bazel or Buck to
+walk the actual dependency graph of PRs (ignoring / banning some of Python's
+dynamism e.g. `importlib.import_module`) to determine what tests to run on a PR.
+This could also be implemented at the human level, with developers tagging their
+PRs based on what they think should run, though this has a higher potential to
+miss certain tests. However, this run-what-changed future would be difficult to
+achieve.
+
+Other efforts involve looking into tests themselves to determine why they are slow.
+Often TVM's tests are running much more work than they actually intend to test in
+more of an integration test than a unit test. Replacing these types of test with
+a framework that makes it easier to test TVM passes and functionality in smaller
+chunks is related but orthogonal to this work. It still has the same issue (coverage
+is reduced) but with the drawback of requiring significantly higher resources (though
+it is on our roadmap in the near future). Over time as slow tests are manually
+debugged, `@slow` decorators could be removed.
+
+# Prior art
+
+[prior-art]: #prior-art
+
+- Tensorflow skips long tests: https://www.tensorflow.org/community/contribute/tests#test_times_should_aim_for_half_of_test_size_timeout_to_avoid_flakes
+- PyTorch does essentially the same thing with a `@slow` decorator but maintains
+  an automatically updating list of slow tests to skip: https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_utils.py#L748-L755
+- scikit-learn has a similar system for low-signal, slow tests: https://github.com/scikit-learn/scikit-learn/pull/21645
+
+# Unresolved questions
+
+[unresolved-questions]: #unresolved-questions
+
+- What tests do we `@slow`? Based on discussion it seems like going down the list of slow tests one by one (at least to start), it is prudent to do a preliminary investigation to answer:
+
+  1. What is this test? Should it be slow?
+  2. How often does this test fail? If the test fails often, there is less of a case that it should be `@slow`-ed since it provides good signal to developers.
+  3. Who relies on this test? Do they understand the implications of `@slow`?
+
+- Who will monitor `main` for PR-related breakages? What is the SLA on fixes? Recent additions such as [messaging Discord on `main` failures](https://github.com/tlc-pack/ci-monitoring) and keeping track of the last known good commit (apache/tvm#10056) should make this easier.

Review comment:
       ```suggestion
   - Who will monitor `main` for PR-related breakages? What is the SLA on fixes? Recent additions such as [messaging Discord on `main` failures](https://github.com/tlc-pack/ci-monitoring) and keeping track of the last known good commit ([apache/tvm#10056](https://github.com/apache/tvm/pull/10056)) should make this easier.
   ```

##########
File path: rfcs/0055-slow-tests.md
##########
@@ -0,0 +1,110 @@
+- Feature Name: `slow_tests_decorator`
+- Start Date: 2022-01-26
+- RFC PR: [apache/tvm-rfcs#55](https://github.com/apache/tvm-rfcs/pull/55)
+
+# Summary
+
+[summary]: #summary
+
+Add a Python decorator to skip tests on PRs but run them on branches (e.g. `main`).
+
+# Motivation
+
+[motivation]: #motivation
+
+A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.
+
+# Guide-level explanation
+
+[guide-level-explanation]: #guide-level-explanation
+
+CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (apache/tvm#9834) or between (apache/tvm#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on `main` only, we still get some measure of coverage provided by these tests without burdening PR developers.
+
+The runtime savings of this change are potentially significant, as this gives us a black-box knob which we can manually tune over time to trade off between PR test coverage and PR test runtime. [This gist](https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1) shows a listing of TVM test runtime in descending order. And this list the potential time savings (across all jobs) in CI of cutting off tests above an arbitrary runtime (not to propose use a cutoff, but just to demonstrate in broad strokes the potential impact of this change):
+
+```
+[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
+[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
+[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
+[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
+[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
+[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
+[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
+[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
+[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
+[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests
+```
+
+# Reference-level explanation
+
+[reference-level-explanation]: #reference-level-explanation
+
+A decorator `@tvm.testing.slow` will be added (see apache/tvm#10057) that implements
+the above behavior. Skipping slow tests would be an opt-in, rather than opt-out.
+This way developers who don't read this RFC won't have to adjust their workflows
+at all to run these tests locally. There is also a need to run slow tests on PRs
+in some cases, such as fixing reverted commits or if a developer suspects their
+change would have a wide reaching impact. In this case, `@ci run slow tests` can
+be added to the PR body before tests are run in order to disable skipping slow tests.
+
+Using a decorator has the advantage of being explicit compared to an automated system to detect and skip slow tests. Clicking through to the decorator’s short definition makes it clear what is going on so this shouldn’t add too much of a development burden. Slow tests run by default to minimize disruption but can be controlled by setting `SKIP_SLOW_TESTS=1`.
+
+An environment variable `SKIP_SLOW_TESTS=1` will be set in Jenkins on PRs. Branches,
+including `main` and `ci-docker-staging` will not have this flag set and will
+always run the full set of tests.
+
+# Drawbacks
+
+[drawbacks]: #drawbacks
+
+The primary caveat is that tests that run on `main` may now fail due to PRs that were green when they were merged, so this will require some buy-in from all TVM developers. However, the runtime savings are significant (see below) enough to make this worth it. Developments like apache/tvm#9554 will make the revert process much smoother as well to minimize disruptions.
+
+# Rationale and alternatives
+
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+This isn't a complete solution. Most PRs end up running lots of tests that the
+PR didn't affect at all. Ideally we would have a way a-la Bazel or Buck to
+walk the actual dependency graph of PRs (ignoring / banning some of Python's
+dynamism e.g. `importlib.import_module`) to determine what tests to run on a PR.
+This could also be implemented at the human level, with developers tagging their
+PRs based on what they think should run, though this has a higher potential to
+miss certain tests. However, this run-what-changed future would be difficult to
+achieve.
+
+Other efforts involve looking into tests themselves to determine why they are slow.
+Often TVM's tests are running much more work than they actually intend to test in
+more of an integration test than a unit test. Replacing these types of test with
+a framework that makes it easier to test TVM passes and functionality in smaller
+chunks is related but orthogonal to this work. It still has the same issue (coverage
+is reduced) but with the drawback of requiring significantly higher resources (though
+it is on our roadmap in the near future). Over time as slow tests are manually
+debugged, `@slow` decorators could be removed.
+
+# Prior art
+
+[prior-art]: #prior-art
+
+- Tensorflow skips long tests: https://www.tensorflow.org/community/contribute/tests#test_times_should_aim_for_half_of_test_size_timeout_to_avoid_flakes
+- PyTorch does essentially the same thing with a `@slow` decorator but maintains
+  an automatically updating list of slow tests to skip: https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_utils.py#L748-L755
+- scikit-learn has a similar system for low-signal, slow tests: https://github.com/scikit-learn/scikit-learn/pull/21645
+
+# Unresolved questions
+
+[unresolved-questions]: #unresolved-questions
+
+- What tests do we `@slow`? Based on discussion it seems like going down the list of slow tests one by one (at least to start), it is prudent to do a preliminary investigation to answer:
+
+  1. What is this test? Should it be slow?
+  2. How often does this test fail? If the test fails often, there is less of a case that it should be `@slow`-ed since it provides good signal to developers.
+  3. Who relies on this test? Do they understand the implications of `@slow`?
+
+- Who will monitor `main` for PR-related breakages? What is the SLA on fixes? Recent additions such as [messaging Discord on `main` failures](https://github.com/tlc-pack/ci-monitoring) and keeping track of the last known good commit (apache/tvm#10056) should make this easier.
+
+# Future possibilities
+
+[future-possibilities]: #future-possibilities
+
+- Better communication in Jenkins job pages of which tests ran, which did not, and why
+- Different levels of tests. `main` is the most frequent step, but longer running tests could be moved out to nightly or even release level testing (though this makes debugging failures more difficult).

Review comment:
       One of the concerns raised by @areusch is the change in coverage by removing slow tests, it would be good to investigate the difference in coverage both in Python and C++ to help guide us in re-introducing tests.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org