You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tvm.apache.org by tqchen via Apache TVM Discuss <no...@discuss.tvm.ai> on 2021/02/09 18:38:54 UTC

[Apache TVM Discuss] [Development/RFC] [RFC] Python Dependencies in TVM CI Containers


Thanks for the discussions so far. I think everyone seems to agree on T1, it is good to have a way to have a place where users and developers can look into. T2 and T3 are the ones that worth discussing.

Trying to summarizing a few rationales of the current CI pipeline.

### R0: Use Binary Tag for Reproducibility and Stablization

During testing, the dependency being used goes beyond python dependencies(what a poetry lock file can capture) and will include things like blas version, nnpack code and cuda runtime.

**Docker binary image** is still the only reliable source of truth if we want to exactly reproduce the test runs. This is why we opted for the usage of binary tag and stage dependency images during test dependency update. So developers can pick up these tags.

### R1: Manual Fixes are Unavoidable During Dependency Update

Changes to APIs like tensorflow, pytorch will likely requires upgrade of the test cases, which means manual fixes are not avoidable. This factor also creates slight favor of building docker images separately, so incremental changes can be made. 

### R2: Simple Flow to Build the Docker Image

For different down stream users e.g. folks might prefer a different CI flow that directly build the docker image from the source. Having a simple flow `docker/build.sh` to do so is super useful to these points.

### Discussions

The above factors might lead to the conclusion that l1 is likely the simplest approach to take. They are also the plan that is immediately actionable. 

On one hand, l2/l3 indeed captures more python dependencies via a lock, they do not necessarily capture all the dependency needed(see R0). Additionally, l2/l3 might involve checking in a lock file((which cannot be reviewed as we do not immediately know the impact of the change) into the repo that was produced from a file in the repo creating a cyclic-like dependency in the upgrading process(txt file to lock file).

The original constraint file(can be approved by a human) and binary tag(can be verified by CI) generally satisfies the two needs(of readability and final pining pointing), and makes R1 generally easier. Additionally, the set of python dependencies can be generated through `pip freeze` in the `ci-gpu` image(or `ci-cpu` image for cpu only packages), which means we still meed our original  requirements of tracking ci deps.





---
[Visit Topic](https://discuss.tvm.apache.org/t/rfc-python-dependencies-in-tvm-ci-containers/9011/6) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/6b078c3d9f03333c8a5955da227883cd084adcec76e20d44a0fa7172ff550978).

[Apache TVM Discuss] [Development/RFC] [RFC] Python Dependencies in TVM CI Containers

Posted by Andrew Reusch via Apache TVM Discuss <no...@discuss.tvm.ai>.

one thing that's particularly annoying is when an error occurs in `ci-gpu`. I can't run `ci-gpu` locally because I don't have nvidia-docker or a compatible GPU, and people on e.g. Windows will be in exactly the same boat. it's sometimes possible to quickly run the failing code locally in e.g. `ci-cpu` and reasonably expect it to fail in the same way. but you may or may not reproduce the error in doing that. in this case, you may wonder whether there is a discrepancy between Python deps that's responsible, but there's no good way to eliminate that variable.

I agree with your sentiment that strong consistency might not contribute to the overall health of the CI, but lack of consistency can make problems harder to debug when you don't have access to the full spectrum of CI variants. I think we should be judicious when we allow the CI to vary, and i'm not sure that e.g. allowing attrs to vary between containers buys us anything other than uncertainty.

there are a couple of other problems that arise from a lack of standardized packages:
- if you want to replicate the CI on some other linux system: which container's `pip freeze` should you use? right now, you have to be a CI expert to know the answer to that. it would be hard, as a newcomer to the project, to decipher which container tests which functionality, so you probably just guess and hope you picked right.
- if we ever start relying on dev packages from `ci-gpu` which are also present in `ci-lint` (i.e. rerunning pylint for some reason), then discrepancies in those precise versions could cause late-stage CI failures
- it makes it difficult to rebuild the CI containers if you want to change something other than the Python packages, since you can't avoid signing up to update them.





---
[Visit Topic](https://discuss.tvm.apache.org/t/rfc-python-dependencies-in-tvm-ci-containers/9011/13) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/f9ed8b64b60e9e3efea973908448d1e7503b9aad003cad1b9978391ab186e9c7).

[Apache TVM Discuss] [Development/RFC] [RFC] Python Dependencies in TVM CI Containers

Posted by tqchen via Apache TVM Discuss <no...@discuss.tvm.ai>.

Yes, I agree with you that different containers could have different packages for things are not pinned(thus my comment about the need of a strong consistency among ci variants).

My last post is mainly about how big an issue that can cause, as we can pinning down most direct deps, and also binary packages for different platforms independently(from the corresponding ci container)





---
[Visit Topic](https://discuss.tvm.apache.org/t/rfc-python-dependencies-in-tvm-ci-containers/9011/12) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/0a3b7d63e9c1e98231f313bdb9f19a0755f90a1849217515978dc37c1e473fdb).

[Apache TVM Discuss] [Development/RFC] [RFC] Python Dependencies in TVM CI Containers

Posted by Andrew Reusch via Apache TVM Discuss <no...@discuss.tvm.ai>.

here's what I don't like about I1:
 - it doesn't ensure dependency versions are compatible across all the various extras in each container architecture
 - it still means that `ci-lint` or `ci-gpu` could wind up with different package versions from `ci-cpu`

I do agree that most of the large direct dependencies are quite pinned, and we could probably accomplish a lot from a ci-constraints.txt which just explicitly pins those direct dependencies.





---
[Visit Topic](https://discuss.tvm.apache.org/t/rfc-python-dependencies-in-tvm-ci-containers/9011/11) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/40d0a9c2d691a03b4c2be45885f95f838c7ada36cdcaa30fdaa9eb0bcbfb61d5).

[Apache TVM Discuss] [Development/RFC] [RFC] Python Dependencies in TVM CI Containers

Posted by tqchen via Apache TVM Discuss <no...@discuss.tvm.ai>.

One potential idea is we can first start with l1, which is simpler, then see if there is need to move to l4. Personally I don't think we really need a stronger consistency of packages in CI spectrum, sometimes we might want to diversify it a bit(e.g. use different compiler such as clang/gcc). We are also producing different tlcpack binary packages for different system, so we could derive the cpu package from `ci-cpu` and gpu package from `ci-gpu`.

Additionally, most of the dependencies that are relevant should have already captured by the constraint.txt e.g. `tensorflow==1.6` so there is not a lot of room to pinning down further.





---
[Visit Topic](https://discuss.tvm.apache.org/t/rfc-python-dependencies-in-tvm-ci-containers/9011/10) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/bf3f9698241079efe0eab4b6b5106c5220e61109cfd36f5ac878573349478a6b).

[Apache TVM Discuss] [Development/RFC] [RFC] Python Dependencies in TVM CI Containers

Posted by Andrew Reusch via Apache TVM Discuss <no...@discuss.tvm.ai>.

@tqchen Thanks for the reply! A couple of points:

- You're right that I2/I3 don't capture all dependencies. There is a notable gap between a set of Python dependencies and the docker container: the remaining libraries which may be present in the container.
- The method by which those libraries are installed is platform-dependent, but so far our approach for the CI has been mostly to depend on a package manager to satisfy them.
- I think that even if a lockfile can't be used primarily to reproduce the test environment, there are plenty of other uses--here are a couple:
    - Determining which version of a C library is suitable when trying to install TVM on a system other than the docker containers. You might prefer to at least know the version of the Python package which uses it. I had trouble finding a non-dev dependency like this but I think this pattern is very common, and [Pillow](https://pillow.readthedocs.io/en/latest/installation.html#external-libraries) is an example of such a package.
    - Often times a user isn't trying to exactly reproduce the regression--they just want to know which minor version of a package to install e.g. tensorflow 2.1 vs 2.2. This can make a big difference, especially when the main problem is API incompatibilty.. Even if a user might not 100% be able to reproduce the regression, it's often the case that if they're using TVM in an application, they won't be able to use the TVM CI containers as a starting point.

### Updating Docker Containers
I think it's worth thinking through the case of updating the docker containers a little more because I'm not sure I follow your reasoning here.

The current process is as follows, per-container:
1. update docker/install and Dockerfile as needed to modify the container
2. run `docker/build.sh` to build the new container image
3. Fix problems and rerun steps 1 & 2 as necessary.
4. test the containers -- discussed later

The current build process is quite straightforward--it's just building a docker container. I2 and I3 propose to add these changes:
1. prior to building the containers, create a lockfile
2. rebuild all containers when the lockfile changes, rather than just updating one

Change 2 shouldn't significantly burden developers unless the build process is error-prone. We should fix that if it is, and perhaps a nightly container build isn't a bad Jenkins job to configure so we can address these failures as they happen.

Change 1 is the one that needs to be thought through. Ordinarily, building a lockfile is running a single command on the target platform, and often this process takes about a minute. This would not be a significant burden. The tricky part is: we have multiple platforms in the CI (three, in fact: manylinux1_i686, manylinux1_x86_64, and manylinux_arm) and it's not possible to build the lockfile for a platform other than the one you're running on.

In thinking this through again, I realized that actually this is the same problem as the one we have with Windows: it's not possible to use a single lockfile for all platforms. If you are switching platforms, you are going to have to accept you may need to switch dependency versions.

So let's return to the ultimate goals of I2/I3 approach:
1. Ensure all features of TVM can actually be used simultaneously on a given platform.
2. Standardize dependencies so that the unittests don't fail for different reasons on different containers.
3. provide enough input to allow the `tlcpack` build process to place narrower constraints on `install_requires` included there, based on the versions used in CI

In light of the above, all of these goals can be achieved only within a single pip platform. So I'd propose a new way to install dependencies while addressing this, I4:

I4. A multi-part approach:
1. For each pip platform that has a ci- container built around it: produce a lockfile either by:
    1. `pip install -r` all TVM requirements and `pip freeze`
    2. `poetry lock`

    The lockfile becomes part of the artifacts created when the containers are built. When multiple containers are built for the same pip platform (e.g. ci-lint, ci-cpu, ci-gpu, ci-qemu, ci-wasm), pick one container which will build the lockfile. The rest will merely consume this lockfile when installing packages either as pip constraints file or poetry lockfile.

2. I would argue that between platforms, the version standardization we care about is effectively TVM direct Python dependencies to the minor version level. For instance, if we add Windows tests, we prefer to test against e.g. tensorflow 1.6 on both linux and Windows. To address this, when building containers, we would first build a pilot platform e.g. `manylinux1_x86_64`, scan the resulting lockfile for direct dependencies of TVM, and produce a loose constraint file for the remaining platforms. For example, if the lockfile contained an entry `tensorflow==1.6.8`, we may add `tensorflow>=1.6,<1.7` to the cross-platform lockfile.

   -> This may not always be possible (conda may not update fast enough for us, etc), so we will need to have an override file per-platform. This is complex, but it's likely that this is happening anyway in the release process now, and the reasons why different packages are being installed aren't documented anywhere.

### Checking in new containers

To check in the built containers, these things need to happen:
1. upload candidate containers to docker hub under your username
2. prepare a TVM PR which updates the Jenkinsfile to reference these test containers
3. have a TVM committer push the PR to [ci-docker-staging](https://github.com/apache/tvm/tree/ci-docker-staging).
4. await test results and iterate as needed.
5. push candidate containers to `tlcpack` docker hub account
6. amend the TVM PR to point at `tlcpack` images and submit it.

Here we're proposing that in step 2, we also include the per-platform lockfiles  produced from container building.

### `tlcpack` Dependencies

Finally: I4 proposes we produce a cross-platform constraints file. This isn't discussed as part of this RFC (it's the next one), but it would be nice to produce a `tlcpack` wheel which includes some loose version constraints (e.g. so that by depending on tensorflow, a user running `pip install tlcpack` would install tensorflow 1.x and not 2.x). The version constraints in the cross-platform constraints file are exactly those we should use in such a `tlcpack` wheel. 

Previously, the follow-on RFC had proposed to analyze the CI lockfile and produce the `tlcpack` dependencies. I4 just proposes to move that analysis earlier, and effectively means that the CI would be testing against the same constraints present in the `tlcpack` wheel. This seems like a better change to me.





---
[Visit Topic](https://discuss.tvm.apache.org/t/rfc-python-dependencies-in-tvm-ci-containers/9011/7) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/236310e0290ee593ac5dc3d79d1abe4c49dcf88bb76c52dca48debb17eb7a68f).