You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/06/19 02:50:10 UTC
[GitHub] feevos opened a new issue #11331: gluon bug: AttributeError:
'_thread._local' object has no attribute 'value'
feevos opened a new issue #11331: gluon bug: AttributeError: '_thread._local' object has no attribute 'value'
URL: https://github.com/apache/incubator-mxnet/issues/11331
## Description
Dear all,
I am trying to run mxnet in a distributed HPC environment for embarrassingly parallel (distributed) runs.
The goal is to use this for bayesian hyperparameter optimization, therefore all communication between nodes is nothing mxnet/gpu specific (lists of hyperparams, like learning rate, batch size etc). For my distributed needs I chose [ray](http://ray.readthedocs.io/en/latest/). Each node has 4 gpus and runs a completely independent run from other nodes. However, I cannot even define a simple gluon layer within
a ```ray.remote``` function.
When I am using 2 (or more) nodes with this trivial example, everything is working:
```Python
import os
import sys
import ray
import time
# mxnet gpu examples
import mxnet as mx
from mxnet import nd
import numpy as np
@ray.remote(num_gpus = 4)
def f():
gpus = [int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(',')] # In case of multiple GPUs, comment out 2nd option.
tctx = [mx.gpu(i) for i in range(len(gpus))]
a = nd.random.uniform(shape=[3,4,16,16],ctx=tctx[0])
return a.asnumpy()
if __name__ == '__main__':
ray.init( redis_address = sys.argv[1])
result1 = ray.get(f.remote())
result2 = ray.get(f.remote())
print (result1,result2)
```
However, when I try to use any gluon object that derives from HybridBlock, for example:
```Python
@ray.remote(num_gpus=4)
def f(x):
loss = gluon.loss.L2Loss()
return x
```
I get an error. I've also tested ray with a simple pytorch nn (everything is working), so this is most probably a mxnet/gluon problem.
## Environment info (Required)
All nodes are identical, I've run diagnose.py command on an interactive node with 4 gpus allocated
```Python
----------Python Info----------
Version : 3.6.4
Compiler : GCC 7.2.0
Build : ('default', 'Jan 16 2018 18:10:19')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 9.0.1
Directory : /home/dia021/Software/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/home/dia021/Software/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Version : 1.3.0
Directory : /home/dia021/Software/mxnet
Commit Hash : 0910450110c37da9f052f3b29c40c6d051f46a6a
----------System Info----------
Platform : Linux-4.4.114-94.11-default-x86_64-with-SuSE-12-x86_64
system : Linux
node : b050
release : 4.4.114-94.11-default
version : #1 SMP Thu Feb 1 19:28:26 UTC 2018 (4309ff9)
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-27
Off-line CPU(s) list: 28-55
Thread(s) per core: 1
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping: 1
CPU MHz: 2599.787
BogoMIPS: 5199.57
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_single pln pts dtherm intel_pt spec_ctrl retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx xsaveopt cqm_llc cqm_occup_llc
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0035 sec, LOAD: 1.1334 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0084 sec, LOAD: 0.9156 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2081 sec, LOAD: 0.0405 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.2181 sec, LOAD: 0.2282 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0068 sec, LOAD: 0.5931 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0075 sec, LOAD: 0.0752 sec.
```
nvidia-smi
```Python
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 00000000:04:00.0 Off | 0 |
| N/A 31C P0 32W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 00000000:06:00.0 Off | 0 |
| N/A 29C P0 32W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... Off | 00000000:07:00.0 Off | 0 |
| N/A 30C P0 31W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... Off | 00000000:08:00.0 Off | 0 |
| N/A 32C P0 29W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
```
## Error Message:
```Python
Traceback (most recent call last):
File "test_ray.py", line 75, in <module>
x1 = ray.get(feature1_id)
File "/home/dia021/Software/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2321, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(250d79352e7800faddddf2c11ec6fd6ea65c20b8). It was created by remote function __main__.f which failed with:
Remote function __main__.f failed with:
Traceback (most recent call last):
File "test_ray.py", line 30, in f
loss = gluon.loss.L2Loss()
File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
super(Loss, self).__init__(**kwargs)
File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
super(HybridBlock, self).__init__(prefix=prefix, params=params)
File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
prefix = _name.NameManager._current.value.get(None, hint) + '_'
AttributeError: '_thread._local' object has no attribute 'value'
Remote function __main__.f failed with:
Traceback (most recent call last):
File "test_ray.py", line 30, in f
loss = gluon.loss.L2Loss()
File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
super(Loss, self).__init__(**kwargs)
File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
super(HybridBlock, self).__init__(prefix=prefix, params=params)
File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
prefix = _name.NameManager._current.value.get(None, hint) + '_'
AttributeError: '_thread._local' object has no attribute 'value'
You can inspect errors by running
ray.error_info()
If this driver is hanging, start a new one with
ray.init(redis_address="10.141.1.77:6379")
```
## Minimum reproducible example
This is a python file. I needs to be executed after the ray cluster has initiated with (in SLURM environment) srun python name_of_file.py
```Python
# Distributed stuff
import ray
#mxnet
from mxnet import gluon
# A trivial function to reproduce the example
@ray.remote(num_gpus=4)
def f(x):
loss = gluon.loss.L2Loss()
return x;
if __name__ == '__main__':
# here sys.argv[1] is the redis_address after the initiation of the ray cluster
ray.init( redis_address = sys.argv[1] )
feature1_id = f.remote(0)
x1 = ray.get(feature1_id)
print (x1)
```
If you could please provide any hack-around/advice, most appreciated. This is also linked to this [gluon-cv issue](https://github.com/dmlc/gluon-cv/issues/156)
Thank you very much
Foivos
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services