You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/04/17 11:10:42 UTC
[GitHub] marcoabreu opened a new issue #10583: Slave hanging as part of
KVStore test
marcoabreu opened a new issue #10583: Slave hanging as part of KVStore test
URL: https://github.com/apache/incubator-mxnet/issues/10583
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10566/3/pipeline
Logged into the slave and was able to reproduce hang. I tried to recompile from scratch and use other branches, but the run always outputs an error like the following:
```
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/work/mxnet/tools/../3rdparty/dmlc-core/tracker/dmlc_tracker/local.py", line 44, in exec_cmd
raise RuntimeError('Get nonzero return code=%d' % ret)
RuntimeError: Get nonzero return code=-11
```
I have verified that no processes are allocating the GPU before the launch using ```nvidia-smi```. The process never exits and it continues to hog the GPU at about 20% usage and 3.9Gb of GPU RAM while being in an infinite loop.
```
+-----------------------------------------------------------------------------+
│| NVIDIA-SMI 390.25 Driver Version: 390.25 |
│|-------------------------------+----------------------+----------------------+
│| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
│| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
│|===============================+======================+======================|
│| 0 Tesla M60 Off | 00000000:00:1D.0 Off | 0 |
│| N/A 33C P0 38W / 150W | 3910MiB / 7618MiB | 23% Default |
│+-------------------------------+----------------------+----------------------+
│| 1 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
│| N/A 38C P0 38W / 150W | 327MiB / 7618MiB | 0% Default |
│+-------------------------------+----------------------+----------------------+
│
│+-----------------------------------------------------------------------------+
│| Processes: GPU Memory |
│| GPU PID Type Process name Usage |
│|=============================================================================|
│| 0 19374 C /usr/bin/python3 3899MiB |
│| 1 19374 C /usr/bin/python3 315MiB |
│+-----------------------------------------------------------------------------+
```
It's interesting to note the varying GPU allocation at that point.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services