You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/06/19 02:50:10 UTC

[GitHub] feevos opened a new issue #11331: gluon bug: AttributeError: '_thread._local' object has no attribute 'value'

feevos opened a new issue #11331: gluon bug: AttributeError: '_thread._local' object has no attribute 'value'
URL: https://github.com/apache/incubator-mxnet/issues/11331
 
 
   ## Description
   Dear all, 
   
   I am trying to run mxnet in a distributed HPC environment for embarrassingly parallel (distributed) runs.   
   The goal is to use this for bayesian hyperparameter optimization, therefore all communication between nodes is nothing mxnet/gpu specific (lists of hyperparams, like learning rate, batch size etc). For my distributed needs I chose [ray](http://ray.readthedocs.io/en/latest/). Each node has 4 gpus and runs a completely independent run from other nodes. However, I cannot even define a simple gluon layer within 
   a ```ray.remote``` function. 
   
   When I am using 2 (or more) nodes with this trivial example, everything is working: 
   
   ```Python
   import os
   import sys
   import ray
   import time
   
   # mxnet gpu examples 
   import mxnet as mx
   from mxnet import nd
   import numpy as np
   
   
   @ray.remote(num_gpus = 4)
   def f():
   
       gpus = [int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(',')] # In case of multiple GPUs, comment out 2nd option. 
       tctx = [mx.gpu(i) for i in range(len(gpus))]
       a = nd.random.uniform(shape=[3,4,16,16],ctx=tctx[0])
   
   
       return a.asnumpy()
   
   if __name__ == '__main__':
       ray.init( redis_address =  sys.argv[1])
       result1 = ray.get(f.remote())
       result2 = ray.get(f.remote())
   
       print (result1,result2)
   ```
   
   However, when I try to use any gluon object that derives from HybridBlock, for example: 
   
   ```Python
   @ray.remote(num_gpus=4)
   def f(x):
       loss = gluon.loss.L2Loss()
       return x
   ```
   I get an error. I've also tested ray with a simple pytorch nn (everything is working), so this is most probably a mxnet/gluon  problem.  
   
   
   
   ## Environment info (Required)
   All nodes are identical, I've run diagnose.py command on an interactive node with 4 gpus allocated 
   
   ```Python
   
   ----------Python Info----------
   Version      : 3.6.4
   Compiler     : GCC 7.2.0
   Build        : ('default', 'Jan 16 2018 18:10:19')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 9.0.1
   Directory    : /home/dia021/Software/anaconda3/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   /home/dia021/Software/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
     from ._conv import register_converters as _register_converters
   Version      : 1.3.0
   Directory    : /home/dia021/Software/mxnet
   Commit Hash   : 0910450110c37da9f052f3b29c40c6d051f46a6a
   ----------System Info----------
   Platform     : Linux-4.4.114-94.11-default-x86_64-with-SuSE-12-x86_64
   system       : Linux
   node         : b050
   release      : 4.4.114-94.11-default
   version      : #1 SMP Thu Feb 1 19:28:26 UTC 2018 (4309ff9)
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                56
   On-line CPU(s) list:   0-27
   Off-line CPU(s) list:  28-55
   Thread(s) per core:    1
   Core(s) per socket:    14
   Socket(s):             2
   NUMA node(s):          2
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 79
   Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
   Stepping:              1
   CPU MHz:               2599.787
   BogoMIPS:              5199.57
   Virtualization:        VT-x
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              35840K
   NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26
   NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_single pln pts dtherm intel_pt spec_ctrl retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx xsaveopt cqm_llc cqm_occup_llc
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0035 sec, LOAD: 1.1334 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0084 sec, LOAD: 0.9156 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2081 sec, LOAD: 0.0405 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.2181 sec, LOAD: 0.2282 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0068 sec, LOAD: 0.5931 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0075 sec, LOAD: 0.0752 sec.
   ```
   
   nvidia-smi 
   ```Python
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |===============================+======================+======================|
   |   0  Tesla P100-SXM2...  Off  | 00000000:04:00.0 Off |                    0 |
   | N/A   31C    P0    32W / 300W |      0MiB / 16280MiB |      0%      Default |
   +-------------------------------+----------------------+----------------------+
   |   1  Tesla P100-SXM2...  Off  | 00000000:06:00.0 Off |                    0 |
   | N/A   29C    P0    32W / 300W |      0MiB / 16280MiB |      0%      Default |
   +-------------------------------+----------------------+----------------------+
   |   2  Tesla P100-SXM2...  Off  | 00000000:07:00.0 Off |                    0 |
   | N/A   30C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
   +-------------------------------+----------------------+----------------------+
   |   3  Tesla P100-SXM2...  Off  | 00000000:08:00.0 Off |                    0 |
   | N/A   32C    P0    29W / 300W |      0MiB / 16280MiB |      0%      Default |
   +-------------------------------+----------------------+----------------------+
   ```
   
   
   ## Error Message:
   ```Python 
   Traceback (most recent call last):
     File "test_ray.py", line 75, in <module>
       x1 = ray.get(feature1_id)
     File "/home/dia021/Software/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2321, in get
       raise RayGetError(object_ids, value)
   ray.worker.RayGetError: Could not get objectid ObjectID(250d79352e7800faddddf2c11ec6fd6ea65c20b8). It was created by remote function __main__.f which failed with:
   
   Remote function __main__.f failed with:
   
   Traceback (most recent call last):
     File "test_ray.py", line 30, in f
       loss = gluon.loss.L2Loss()
     File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
       super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
     File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
       super(Loss, self).__init__(**kwargs)
     File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
       super(HybridBlock, self).__init__(prefix=prefix, params=params)
     File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
       self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
     File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
       prefix = _name.NameManager._current.value.get(None, hint) + '_'
   AttributeError: '_thread._local' object has no attribute 'value'
   
   Remote function __main__.f failed with:
   
   Traceback (most recent call last):
     File "test_ray.py", line 30, in f
       loss = gluon.loss.L2Loss()
     File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
       super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
     File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
       super(Loss, self).__init__(**kwargs)
     File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
       super(HybridBlock, self).__init__(prefix=prefix, params=params)
     File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
       self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
     File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
       prefix = _name.NameManager._current.value.get(None, hint) + '_'
   AttributeError: '_thread._local' object has no attribute 'value'
   
   
     You can inspect errors by running
   
         ray.error_info()
   
     If this driver is hanging, start a new one with
   
         ray.init(redis_address="10.141.1.77:6379")
   ```
   
   
   ## Minimum reproducible example
   This is a python file. I needs to be executed after the ray cluster has initiated with (in SLURM environment) srun python name_of_file.py  
   ```Python
   # Distributed stuff 
   import ray
   
   #mxnet 
   from mxnet import gluon
   
   # A trivial function to reproduce the example 
   @ray.remote(num_gpus=4)
   def f(x):
       loss = gluon.loss.L2Loss()
       return x;
   
   
   if __name__ == '__main__':
       # here sys.argv[1] is the redis_address after the initiation of the ray cluster 
       ray.init( redis_address =  sys.argv[1]  )
   
       feature1_id = f.remote(0)
       x1 = ray.get(feature1_id)
   
       print (x1)
   
   ```
   
   If you could please provide any hack-around/advice, most appreciated. This is also linked to this [gluon-cv issue](https://github.com/dmlc/gluon-cv/issues/156) 
   
   Thank you very much
   Foivos 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services