You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by "DickJC123 (via GitHub)" <gi...@apache.org> on 2023/02/17 22:23:58 UTC

[GitHub] [mxnet] DickJC123 opened a new pull request, #21182: [BUGFIX] Fix segfault at training exit

DickJC123 opened a new pull request, #21182:
URL: https://github.com/apache/mxnet/pull/21182

   ## Description ##
   This provides an updated fix for issue https://github.com/apache/mxnet/issues/19379.  To understand this fix, a bit of history is needed:
   
   - At some point (possibly the arrival of ubuntu 20.04) the destruction order at program exit of the CUDA context and MXNet's singleton engine became non-deterministic.  When the engine destruction occurred after the CUDA context destruction, a segfault would occur due to the release of CUDA resources to a non-existent context.
   - @ptrendx supplied the fix of not destroying MXNet's Stream objects at exit in master PR https://github.com/apache/mxnet/pull/19378, which also was back-ported to v1.x.
   - Since that time, improvements to CUDA have made it no longer susceptible to the problem, starting possibly with CUDA 11.2.  CUDA 10.2 and 11.0 are confirmed to be susceptible.
   - A different issue https://github.com/apache/mxnet/issues/20959 was found to be due to the lack of CUDA resource cleanup at exit.  As a result, @ptrendx's PR was reverted (on the v1.9.x branch) here https://github.com/apache/mxnet/pull/20998.  Due to the improvements in CUDA, most users did not experience the return of the original segfault-at-exit problem.
   - But clearly, for users still on CUDA 10.2 and 11.0, the segfault behavior has returned (see recent posts to issue https://github.com/apache/mxnet/issues/19379).
   
   This PR resupplies the fix of not destroying MXNet's Stream objects, but applies this remedy only when the main Python process is exiting (as detected by `shutdown_phase_ == true`). The destruction of Streams in the dataloader side-processes should not be affected, and so the data memory leak should not resurface.
   
   ## Checklist ##
   ### Essentials ###
   - [X] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
   - [X] Changes are complete (i.e. I finished coding on this PR)
   - [ ] All changes have test coverage
   - [X] Code is well-documented
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mxnet] josephevans merged pull request #21182: [BUGFIX] Fix segfault at training exit

Posted by "josephevans (via GitHub)" <gi...@apache.org>.
josephevans merged PR #21182:
URL: https://github.com/apache/mxnet/pull/21182


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mxnet] chinakook commented on pull request #21182: [BUGFIX] Fix segfault at training exit

Posted by "chinakook (via GitHub)" <gi...@apache.org>.
chinakook commented on PR #21182:
URL: https://github.com/apache/mxnet/pull/21182#issuecomment-1447761717

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mxnet] DickJC123 commented on pull request #21182: [BUGFIX] Fix segfault at training exit

Posted by "DickJC123 (via GitHub)" <gi...@apache.org>.
DickJC123 commented on PR #21182:
URL: https://github.com/apache/mxnet/pull/21182#issuecomment-1441007667

   > Thanks for the fix! Shall we add a test to cover it? Maybe something like the example in #20959
   
   To clarify @waytrue17, are you asking for the example in #20959 repackaged into a 'dataloader leak unittest'?  Regarding a unittest targeting the segfault issue, do you know if the CI builds have horovod installed (a requirement)?  Do we have multi-GPU testing in CI?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mxnet] mxnet-bot commented on pull request #21182: [BUGFIX] Fix segfault at training exit

Posted by "mxnet-bot (via GitHub)" <gi...@apache.org>.
mxnet-bot commented on PR #21182:
URL: https://github.com/apache/mxnet/pull/21182#issuecomment-1435344737

   Hey @DickJC123 , Thanks for submitting the PR 
   All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands: 
   - To trigger all jobs: @mxnet-bot run ci [all] 
   - To trigger specific jobs: @mxnet-bot run ci [job1, job2] 
   *** 
   **CI supported jobs**: [unix-gpu, windows-cpu, website, miscellaneous, centos-gpu, windows-gpu, edge, clang, sanity, centos-cpu, unix-cpu]
   *** 
   _Note_: 
    Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin. 
   All CI tests must pass before the PR can be merged. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mxnet] waytrue17 commented on pull request #21182: [BUGFIX] Fix segfault at training exit

Posted by "waytrue17 (via GitHub)" <gi...@apache.org>.
waytrue17 commented on PR #21182:
URL: https://github.com/apache/mxnet/pull/21182#issuecomment-1441039832

   > > Thanks for the fix! Shall we add a test to cover it? Maybe something like the example in #20959
   > 
   > To clarify @waytrue17, are you asking for the example in #20959 repackaged into a 'dataloader leak unittest'? Regarding a unittest targeting the segfault issue, do you know if the CI builds have horovod installed (a requirement)? Do we have multi-GPU testing in CI?
   
   Seems the CI installs horovod only in master at [here](https://github.com/apache/mxnet/blob/48d7f4af70015a559b20bc58aea037548c7e2658/ci/docker/runtime_functions.sh#L1061), and it was disabled due to some [hanging issue](https://github.com/apache/mxnet/blob/fc54fabe61573d0405b6ddead8c1501aedad3a11/ci/jenkins/Jenkinsfile_unix_gpu#L53). 
   Approved the PR since we have verified the fix with local test. Lets bring it in and unblock the customers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mxnet] DickJC123 commented on pull request #21182: [BUGFIX] Fix segfault at training exit

Posted by "DickJC123 (via GitHub)" <gi...@apache.org>.
DickJC123 commented on PR #21182:
URL: https://github.com/apache/mxnet/pull/21182#issuecomment-1440976684

   I went back to the repro python program given in https://github.com/apache/mxnet/issues/20959 and have verified that this PR does not inadvertently reintroduce the memory leak.  Output produced:
   ```
   num_workers: 0 epoch 0: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 0 epoch 1: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 0 epoch 2: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
   num_workers: 0 epoch 3: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
   num_workers: 0 epoch 4: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 0: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 1: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
   num_workers: 4 epoch 2: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 3: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 4: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 8 epoch 0: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
   num_workers: 8 epoch 1: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
   num_workers: 8 epoch 2: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
   num_workers: 8 epoch 3: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
   num_workers: 8 epoch 4: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.  
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org