You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/04/28 16:05:03 UTC

[GitHub] [incubator-mxnet] ChaiBapchya opened a new pull request #18186: Update unix gpu toolchain

ChaiBapchya opened a new pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186


   ## Description ##
   Currently, Unix GPU & Centos GPU tests use P3 & G3 AWS EC2 instances.
   In an effort to improve the cost & efficiency, switch to G4 EC2 instances has been proposed.
   
   This switch involves upgrading the GPU toolchain broadly
   | Host Machine | Old | New |
   |-------------------|---------|-----------|
   | Ubuntu LTS | 16.04.3 | 18.04.3 |
   | Tesla Driver |   M60 | T4 |
   | EC2 Instance Type | G3 | G4 |
   | Docker | 18.09 | 19.03 |
   | NVidia Driver | 418.56 | 440.33.01 |
   | Cuda Driver | 10.1 | 10.2 |
   
   ### Code Changes ###
   1. Latest Docker [19.03] has built-in cuda support [hence replace nvidia-docker with docker --gpus all]
   2. Given that the host machine has updated drivers, TVM Op shouldn't need cuda compat [`/usr/local/cuda/compat`]
   3. replacing `ubuntu_gpu_cu101` with `ubuntu_build_cuda`
   Docker compose follows multi-stage build [https://docs.docker.com/develop/develop-images/multistage-build/] and defines multiple targets
   `ubuntu_build_cuda` target is `gpuwithcudaruntimelibs`
   `ubuntu_gpu_cu101`  target is : gpuwithcompatenv [which has been commented out now]
   
   4. After testing this on CI Dev account : http://jenkins.mxnet-ci-dev.amazon-ml.com/blue/organizations/jenkins/mxnet-validation-bapac%2Funix-gpu/detail/update_gpu_toolchain/8/pipeline
   The TVMOpError related to Binary Ops was encountered : https://github.com/apache/incubator-mxnet/issues/17840
   To unblock the migration from G3 to G4, these flaky tests have been skipped.
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [ ] Changes are complete (i.e. I finished coding on this PR)
   - [ ] All changes have test coverage:
   - [ ] Code is well-documented: 
   - [ ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
   
   ## Comments ##
   Thanks to @ptrendx for the help identifying libcuda compat as the rootcause for 
   ```
   CUDA: Check failed: e == cudaSuccess (803 vs. 0) : system has unsupported display driver / cuda driver combination
   ```
   Helped me close : https://github.com/NVIDIA/nvidia-docker/issues/1252
   
   Thanks to @leezu and @josephevans throughout this migration effort and @sandeep-krishnamurthy @szha for the guidance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#issuecomment-626287381


   Jenkins CI successfully triggered : [windows-gpu]


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#issuecomment-620700667


   Hey @ChaiBapchya , Thanks for submitting the PR 
   All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands: 
   - To trigger all jobs: @mxnet-bot run ci [all] 
   - To trigger specific jobs: @mxnet-bot run ci [job1, job2] 
   *** 
   **CI supported jobs**: [unix-gpu, clang, website, sanity, edge, centos-gpu, windows-cpu, unix-cpu, windows-gpu, centos-cpu, miscellaneous]
   *** 
   _Note_: 
    Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin. 
   All CI tests must pass before the PR can be merged. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416962006



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       Wait why not get rid of ubuntu_cu101 as well? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#issuecomment-626287370


   @mxnet-bot run ci [windows-gpu] 
   
   assertion failed for test_np_mixed_precision_binary_funcs : Likely flaky


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416992378



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       Oops didn't do it cleanly. Fixing it now. https://github.com/apache/incubator-mxnet/pull/18186/commits/ec5330d7c86e770687053480be585a75712ce2a4




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r418358081



##########
File path: tests/python/unittest/test_numpy_op.py
##########
@@ -2220,6 +2222,7 @@ def hybrid_forward(self, F, x):
                 assert same(ret_mx.asnumpy(), ret_np)
 
 
+@unittest.skip("Flaky test https://github.com/apache/incubator-mxnet/issues/17840")

Review comment:
       TVMOP is always turned off for GPU builds on CI now. You can remove these




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416955544



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       So the aim is to get rid of both ubuntu_build_cuda and ubuntu_cu101 and instead use `ubuntu_cpu` [base image]
   Correct? @leezu 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416961539



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       ubuntu_cu101, ubuntu_cu102, etc all make use of that `gpu` target, but have a different base image




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416964658



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       You still need some gpu image. We may later find that (once cuda 11 is released), the gpu target may not be neede anymore and we can point `ubuntu_gpu_cu101` to the `base` target.
   
   `ubuntu_gpu_cu101 would still have a different base image compared to `ubuntu_cpu` `(`nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04` vs `ubuntu:18.04`).

##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       You still need some gpu image. We may later find that (once cuda 11 is released), the gpu target may not be neede anymore and we can point `ubuntu_gpu_cu101` to the `base` target.
   
   `ubuntu_gpu_cu101` would still have a different base image compared to `ubuntu_cpu` `(`nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04` vs `ubuntu:18.04`).

##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       You still need some gpu image. We may later find that (once cuda 11 is released), the gpu target may not be neede anymore and we can point `ubuntu_gpu_cu101` to the `base` target.
   
   `ubuntu_gpu_cu101` would still have a different base image compared to `ubuntu_cpu` (`nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04` vs `ubuntu:18.04`).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#issuecomment-620887845


   Rebased to fix windows-gpu issue : fixed in https://github.com/apache/incubator-mxnet/pull/18177


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416963987



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       We should get rid of the `gpuwithcompatenv` target and point `ubuntu_cu101` to the `gpu` target.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r421651875



##########
File path: tests/python/unittest/test_numpy_op.py
##########
@@ -2220,6 +2222,7 @@ def hybrid_forward(self, F, x):
                 assert same(ret_mx.asnumpy(), ret_np)
 
 
+@unittest.skip("Flaky test https://github.com/apache/incubator-mxnet/issues/17840")

Review comment:
       Alright, removing all the flaky test skips then




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416816441



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       Why this change? Note that `ubunt_build_cuda` is an image that is only provided to workaround some broken design problems
   
   https://github.com/apache/incubator-mxnet/blob/76fa58373636c57fee1e4e6cd7960723b39f455f/ci/docker/Dockerfile.build.ubuntu#L153-L162
   
   We shouldn't migrate more tests to use this image, but rather get rid of this image




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416973134



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       Added as recommended
   https://github.com/apache/incubator-mxnet/pull/18186/commits/022e135585cd830d2095e6611e13afe6955d21ce
   Take a look @leezu 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416963361



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       We don't want to use libcuda compat right? whereas target of ubuntu_cu101 is gpuwithcompatenv




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #18186: Update unix gpu toolchain

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #18186:
URL: https://github.com/apache/incubator-mxnet/pull/18186#discussion_r416961350



##########
File path: ci/jenkins/Jenkins_steps.groovy
##########
@@ -155,7 +155,7 @@ def compile_unix_int64_gpu() {
         ws('workspace/build-gpu-int64') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_gpu_cu101', 'build_ubuntu_gpu_large_tensor', false)
+            utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_large_tensor', false)

Review comment:
       Only `ubuntu_build_cuda`, as we still need
   
   https://github.com/apache/incubator-mxnet/blob/76fa58373636c57fee1e4e6cd7960723b39f455f/ci/docker/Dockerfile.build.ubuntu#L143-L150




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org