You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tvm.apache.org by "Leandro Nunes (Arm) via Apache TVM Discuss" <no...@discuss.tvm.ai> on 2021/06/29 17:04:48 UTC

[Apache TVM Discuss] [Development/pre-RFC] Automated way to health check TVM Dockerfiles

Currently we don't do any sort of automated testing to make sure our Docker images are healthy, so it is not uncommon that the images are sometimes broken and we don't have visibility of the issues. Only when we decide to update the images, then it causes massive pain (e.g. https://github.com/apache/tvm/issues/8177).

Rebuilding our Docker images takes at least a couple hours alone. Hence, it is currently impractical to rebuild images for every PR or merge, due to time constraints.

In order to give visibility of issues in our Dockerfiles, I'd like to propose an automated build that can use our existing infrastructure to re-generate the images from scratch, once a day, so that problems can be spotted early, without increasing even more the time we get to validate our PRs. Also the work needed in maintaining the images can be spread in the community.

This proposal can be implemented by two independent Jenkins pipelines. Here is a summary of what they would do:

1. P1: **daily-docker-images-rebuild:** fetch the latest Dockerfiles definitions on TVM repository and rebuild the images from scratch. If successful, uploads the images to a "tlcpack-staging" (provisional name) DockerHub account.
2. P2: **daily-docker-image-validate:** pulls the latest images from "tlcpack-staging" (provisional name) and runs our existing tests on it, with the latest TVM sources.

As mentioned, the pipelines are independent, and **they are not expected to make any changes to our production CI** (the images used to run our CI from GitHub Pull requests), without manual intervention.

Going a bit more in detail on what each pipeline would accomplish:

daily-docker-images-rebuild
---
* This pipeline is triggered by a timer, running once a day
* Fetch the latest TVM sources
* Uses `docker/build.sh` to rebuild all images currently used in CI: `ci_lint`, `ci_cpu`, `ci_gpu`, `ci_arm`, `ci_i386`, `ci_wasm` and `ci_qemu`.
* Tags images with two tags: `latest` and a timestamp based tag `YYYY-MM-DD-HH-MM-SS-<short_lastest_tvm_git_hash>`
* If successful, uploads them to an account on DockerHub
* In all cases (success or fail) would send notifications on somewhere visible by the community e.g. *Discord channel or mailing lists*

daily-docker-image-validate
---

* This pipeline is triggered by a successful "daily-docker-images-rebuild"
* Fetches the latest TVM and run the existing `tvm/Jenkinsfile` pointing at the images generated by the pipeline above
* In all cases (success or fail) would send notifications on somewhere visible by the community e.g. *Discord channel or mailing lists*

Next steps
---

I have a draft job that implements "daily-docker-images-rebuild", and I'll be posting that in the next few days. In the meantime, I'd like to ask for feedback and ideas on how to deal with the issues described here.

cc @areusch @tqchen @ramana-arm @haichen @jroesch @thierry @Lunderberg @mbrookhart

---
[Visit Topic](https://discuss.tvm.apache.org/t/automated-way-to-health-check-tvm-dockerfiles/10347/1) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/75f3eee174c34030a4410181c8d697ed3cdee32b45d4f3111e716903ddac1f34).

[Apache TVM Discuss] [Development/pre-RFC] Automated way to health check TVM Dockerfiles

Posted by Andrew Reusch via Apache TVM Discuss <no...@discuss.tvm.ai>.

Thanks @leandron for the proposal! I agree this will be a great help in monitoring the container rebuild process for problems and should reduce the headache typically involved with updating containers.

I think we could prototype this first using https://discuss.tvm.apache.org/t/ci-how-to-run-your-own-tlcpack-ci-and-proposing-future-improvements-to-https-ci-tlcpack-ai/10123 so we can iterate without impacting CI runtime, and then migrate it to the production TVM CI to run at night when CI load is lessened.

Below I scope out a couple ideas for future work which may help to motivate this project.

#### Future work: use autobuilt containers for production TVM CI

I think it would be interesting to implement this and then consider only allowing containers built by this process to be promoted to official `tlcpack/ci-*` containers. It's likely we would need some additional work over this to provide a flexible enough interface (e.g. build selected containers on-demand, likely gated to committers) to support this workflow. However, the benefit is that all containers would then be built from a known clean revision of TVM, so a reproducible build is more likely to occur.

To be sure, this approach doesn't provide 100% reproducibility (the container build process contains a bunch of external dependencies e.g. apt packages, LLVM, etc), it ensures those dependencies are documented and provides us a path to collaborate on future movement in that direction, should we so desire.

#### Future work: Build status dashboard

I think it would be great to also consider creating a concise status dashboard that shows a matrix of the build outcomes by container and date. This would make it easy to diagnose failures and bisect the range of PRs which may be suspect.

#### Future work: TVM Python dependencies

https://discuss.tvm.apache.org/t/rfc-python-dependencies-in-tvm-ci-containers/9011 proposed some efforts to capture the set of Python deps used in the CI and improve their consistency. With this process in place, we should be able to finally build the constraints list of x86_64 dependencies. This would allow us to ensure that Python packages in ci-cpu, ci-gpu, and ci-lint match. This has been a point of confusion for me when debugging CI failures in the past.

---
[Visit Topic](https://discuss.tvm.apache.org/t/automated-way-to-health-check-tvm-dockerfiles/10347/2) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/1cf9dad55f7362f4edd8378c474f368d8bc4a793237cd43d7ee6bef3ea674203).