You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/02/01 19:41:24 UTC
[GitHub] renganxu opened a new issue #14047: mxnet.base.MXNetError: Cannot
find argument 'cudnn_algo_verbose'
renganxu opened a new issue #14047: mxnet.base.MXNetError: Cannot find argument 'cudnn_algo_verbose'
URL: https://github.com/apache/incubator-mxnet/issues/14047
## Description
The was error "**mxnet.base.MXNetError: Cannot find argument 'cudnn_algo_verbose**'" when I ran the resnet50 model from Image Classification in MLPerf benchmark.
## Environment info (Required)
```
----------Python Info----------
Version : 3.6.5
Compiler : GCC 7.2.0
Build : ('default', 'Apr 29 2018 16:14:56')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 19.0.1
Directory : /home/rengan/miniconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version : 1.5.0
Directory : /home/rengan/apps/mxnet/dev/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-3.10.0-862.el7.x86_64-x86_64-with-redhat-7.5-Maipo
system : Linux
node : node001
release : 3.10.0-862.el7.x86_64
version : #1 SMP Wed Mar 21 18:14:51 EDT 2018
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 1
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 2400.000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local ibpb ibrs stibp dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0371 sec, LOAD: 0.5303 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0976 sec, LOAD: 0.4910 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1072 sec, LOAD: 0.4491 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0461 sec, LOAD: 0.2345 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0694 sec, LOAD: 2.2932 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0473 sec, LOAD: 0.4014 sec.
```
Package used (Python/R/Scala/Julia): Python
## Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio): gcc 7.2.0
MXNet commit hash: f95e7949dcd96ca2a5a140dbcff16dd344b45d19
Build config:
```
make -j 40 USE_OPENCV=1 \
USE_BLAS=openblas USE_CUDA=1 \
USE_CUDA_PATH=/cm/shared/apps/cuda10.0/toolkit/10.0.130 \
USE_CUDNN=1 \
USE_NCCL=1 \
USE_NCCL_PATH=/home/rengan/apps/nccl/2.3.7
```
## Error Message:
```
Traceback (most recent call last):
File "train_imagenet.py", line 122, in <module>
sym = net.get_symbol(**vars(args))
File "/home/rengan/DL_benchmark/MLPerf/mlperf_results/v0.5.0/dellemc/submission/code/image_classification/mxnet/symbols/resnet-v1b-fl.py", line 370, in get_symbol
use_dali = use_dali)
File "/home/rengan/DL_benchmark/MLPerf/mlperf_results/v0.5.0/dellemc/submission/code/image_classification/mxnet/symbols/resnet-v1b-fl.py", line 250, in resnet
cudnn_tensor_core_only=force_tensor_core)
File "<string>", line 179, in Convolution
File "/home/rengan/apps/mxnet/dev/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/_ctypes/symbol.py", line 125, in _symbol_creator
ctypes.byref(sym_handle)))
File "/home/rengan/apps/mxnet/dev/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Cannot find argument 'cudnn_algo_verbose', Possible Arguments:
----------------
kernel : Shape(tuple), required
Convolution kernel size: (w,), (h, w) or (d, h, w)
stride : Shape(tuple), optional, default=[]
Convolution stride: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
dilate : Shape(tuple), optional, default=[]
Convolution dilate: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
pad : Shape(tuple), optional, default=[]
Zero pad for convolution: (w,), (h, w) or (d, h, w). Defaults to no padding.
num_filter : int (non-negative), required
Convolution filter(channel) number
num_group : int (non-negative), optional, default=1
Number of group partitions.
workspace : long (non-negative), optional, default=1024
Maximum temporary workspace allowed (MB) in convolution.This parameter has two usages. When CUDNN is not used, it determines the effective batch size of the convolution kernel. When CUDNN is used, it controls the maximum temporary storage used for tuning the best CUDNN kernel when `limited_workspace` strategy is used.
no_bias : boolean, optional, default=0
Whether to disable bias parameter.
cudnn_tune : {None, 'fastest', 'limited_workspace', 'off'},optional, default='None'
Whether to pick convolution algo by running performance test.
cudnn_off : boolean, optional, default=0
Turn off cudnn for this layer.
layout : {None, 'NCDHW', 'NCHW', 'NCW', 'NDHWC', 'NHWC'},optional, default='None'
Set layout for input, output and weight. Empty for
default layout: NCW for 1d, NCHW for 2d and NCDHW for 3d.NHWC and NDHWC are only supported on GPU.
, in operator Convolution(name="", layout="NHWC", no_bias="True", cudnn_algo_verbose="0", cudnn_algo_fwd="1", cudnn_algo_bwd_data="1", cudnn_algo_bwd_filter="1", stride="(2, 2)", cudnn_tensor_core_only="1", pad="(3, 3)", num_filter="64", kernel="(7, 7)", workspace="256")
```
## Minimum reproducible example
The mxnet implementation of image classification model Resnet50 in MLPerf:
https://github.com/mlperf/results/tree/master/v0.5.0/nvidia/submission/code/image_classification/mxnet
## Steps to reproduce
first install the dependence nvidia-dali:
```
$ pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali
```
then run the benchmark:
```
$ python3.6 train_imagenet.py \
--gpus 0,1,2,3 \
--batch-size 1664 \
--kv-store device \
--lr 0.6 \
--lr-step-epochs 30,60,80 \
--warmup-epochs 5 \
--eval-period 4 \
--eval-offset 2 \
--optimizer sgd \
--network resnet-v1b-fl \
--num-layers 50 \
--num-epochs 100 \
--accuracy-threshold 0.749 \
--dtype float16 \
--use-dali \
--disp-batches 20 \
--image-shape 4,224,224 \
--fuse-bn-relu 1 \
--fuse-bn-add-relu 1 \
--min-random-area 0.05 \
--max-random-area 1.0 \
--conv-algo 1 \
--force-tensor-core 1 \
--input-layout NHWC \
--conv-layout NHWC \
--batchnorm-layout NHWC \
--pooling-layout NHWC \
--batchnorm-mom 0.9 \
--batchnorm-eps 1e-5 \
--data-train /mnt/isilon/DeepLearning/database/mlperf/ilsvrc2012_mxnet/train.rec \
--data-train-idx /mnt/isilon/DeepLearning/database/mlperf/ilsvrc2012_mxnet/train.idx \
--data-val /mnt/isilon/DeepLearning/database/mlperf/ilsvrc2012_mxnet/val.rec \
--data-val-idx /mnt/isilon/DeepLearning/database/mlperf/ilsvrc2012_mxnet/val.idx \
--dali-prefetch-queue 2 \
--dali-nvjpeg-memory-padding 64
```
## What have you tried to solve it?
1. This error first appeared when I installed mxnet-cu100mkl using pip. Then I built from the source with the latest development code which has all output shown above. But it still has the same error.
2. Since this error is related to cudnn, I check whether the mxnet library is linked to libcudnn or not, and the result is yes:
```
[root@node001 ~]# ldd /home/rengan/apps/mxnet/dev/lib/libmxnet.so |grep cudnn
libcudnn.so.7 => /home/rengan/apps/cudnn/7.4.1/lib64/libcudnn.so.7 (0x00002aaade829000)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services