You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/10/01 18:24:28 UTC
[GitHub] fhieber opened a new issue #12710: Process Deadlock with mxnet-mkl
and mkl-optimized numpy
fhieber opened a new issue #12710: Process Deadlock with mxnet-mkl and mkl-optimized numpy
URL: https://github.com/apache/incubator-mxnet/issues/12710
## Description
mxnet-mkl hangs indefinitely when trying to spawn subprocesses (using mxnet) in an environment that uses MKL-optimized numpy. This is a recent issue we are observing with Sockeye and may be related to #8532, but it can be reproduced without Sockeye (see below).
## Environment info (Required)
- Python 3.6.6
- MacOs
- mxnet-mkl==1.3.0.post0
- Anaconda Numpy (with MKL optimization): `conda install mkl ; conda install numpy`
## Minimum reproducible example
The following code reliably reproduces the deadlock/indefinite hang in the main process.
It creates a minimal module and 'trains' for 500 iterations, spawning itself in 'testing mode' every 100 iterations. The testing mode is the same mxnet code, ran for fewer iterations. The main process is supposed to wait until the subprocess finishes before starting the next one.
code.py:
```python
import subprocess
import sys
import mxnet as mx
if __name__ == '__main__':
if len(sys.argv) > 1:
print("TESTING")
test = True
iterations = 50
else:
print("TRAINING")
test = False
iterations = 500
x = mx.sym.Variable('x')
y = mx.sym.Variable('y')
sym = mx.sym.FullyConnected(x, num_hidden=5)
sym = mx.sym.SoftmaxOutput(sym, y)
x_data = mx.nd.uniform(0, 1, (32, 16))
y_data = mx.nd.zeros((32, 5))
batch = mx.io.DataBatch(data=[x_data], label=[y_data])
mod = mx.mod.Module(sym, data_names=['x'], label_names=['y'])
mod.bind(data_shapes=[mx.io.DataDesc('x', shape=x_data.shape)],
label_shapes=[mx.io.DataDesc('y', shape=y_data.shape)],
for_training=True, grad_req='write' if not test else 'null')
mod.init_params()
mod.init_optimizer()
process = None
for i in range(iterations):
mod.forward(batch)
if not test:
mod.backward()
mod.update()
if i % 100 == 0 and i > 0:
print(i)
if not test:
if process:
print("Waiting for process")
process.wait()
cmd = [sys.executable, sys.argv[0], 'test']
print("Starting process: '%s'" % " ".join(cmd))
process = subprocess.Popen(cmd)
if process:
process.wait()
```
## Steps to reproduce
1. conda install mkl
2. conda install numpy
3. pip install mxnet-mkl
4. python3 code.py
## What have you tried to solve it?
Replacing `mxnet-mkl` with `mxnet` or conda Numpy with pip-installed numpy (`conda uninstall numpy; conda uninstall mkl; pip install numpy`) resolves the issue and the output is as expected:
```
TRAINING
100
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
200
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
300
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
400
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
TESTING
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services