You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/04/17 11:10:42 UTC
[GitHub] marcoabreu opened a new issue #10583: Slave hanging as part of KVStore test

marcoabreu opened a new issue #10583: Slave hanging as part of KVStore test
URL: https://github.com/apache/incubator-mxnet/issues/10583
 
 
    http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10566/3/pipeline
   
   Logged into the slave and was able to reproduce hang. I tried to recompile from scratch and use other branches, but the run always outputs an error like the following:
   
   ```
   Exception in thread Thread-3:                                                                       
   Traceback (most recent call last):                                                                   
     File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner                             
       self.run()                                                                                       
     File "/usr/lib/python2.7/threading.py", line 754, in run                                           
       self.__target(*self.__args, **self.__kwargs)                                                    
     File "/work/mxnet/tools/../3rdparty/dmlc-core/tracker/dmlc_tracker/local.py", line 44, in exec_cmd
       raise RuntimeError('Get nonzero return code=%d' % ret)                                           
   RuntimeError: Get nonzero return code=-11
   ```
   
   I have verified that no processes are allocating the GPU before the launch using ```nvidia-smi```. The process never exits and it continues to hog the GPU at about 20% usage and 3.9Gb of GPU RAM while being in an infinite loop.
   
   ```
   +-----------------------------------------------------------------------------+
   │| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
   │|-------------------------------+----------------------+----------------------+
   │| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   │| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   │|===============================+======================+======================|
   │|   0  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |
   │| N/A   33C    P0    38W / 150W |   3910MiB /  7618MiB |     23%      Default |
   │+-------------------------------+----------------------+----------------------+
   │|   1  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
   │| N/A   38C    P0    38W / 150W |    327MiB /  7618MiB |      0%      Default |
   │+-------------------------------+----------------------+----------------------+
   │
   │+-----------------------------------------------------------------------------+
   │| Processes:                                                       GPU Memory |
   │|  GPU       PID   Type   Process name                             Usage      |
   │|=============================================================================|
   │|    0     19374      C   /usr/bin/python3                            3899MiB |
   │|    1     19374      C   /usr/bin/python3                             315MiB |
   │+-----------------------------------------------------------------------------+
   ```
   
   It's interesting to note the varying GPU allocation at that point.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services