You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/02/01 19:41:24 UTC
[GitHub] renganxu opened a new issue #14047: mxnet.base.MXNetError: Cannot find argument 'cudnn_algo_verbose'

renganxu opened a new issue #14047: mxnet.base.MXNetError: Cannot find argument 'cudnn_algo_verbose'
URL: https://github.com/apache/incubator-mxnet/issues/14047
 
 
   ## Description
   The was error "**mxnet.base.MXNetError: Cannot find argument 'cudnn_algo_verbose**'" when I ran the resnet50 model from Image Classification in MLPerf benchmark.
   
   ## Environment info (Required)
   ```
   ----------Python Info----------
   Version      : 3.6.5
   Compiler     : GCC 7.2.0
   Build        : ('default', 'Apr 29 2018 16:14:56')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 19.0.1
   Directory    : /home/rengan/miniconda3/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.5.0
   Directory    : /home/rengan/apps/mxnet/dev/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   Platform     : Linux-3.10.0-862.el7.x86_64-x86_64-with-redhat-7.5-Maipo
   system       : Linux
   node         : node001
   release      : 3.10.0-862.el7.x86_64
   version      : #1 SMP Wed Mar 21 18:14:51 EDT 2018
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                40
   On-line CPU(s) list:   0-39
   Thread(s) per core:    1
   Core(s) per socket:    20
   Socket(s):             2
   NUMA node(s):          2
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 85
   Model name:            Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
   Stepping:              4
   CPU MHz:               2400.000
   BogoMIPS:              4800.00
   Virtualization:        VT-x
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              1024K
   L3 cache:              28160K
   NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
   NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local ibpb ibrs stibp dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0371 sec, LOAD: 0.5303 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0976 sec, LOAD: 0.4910 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1072 sec, LOAD: 0.4491 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0461 sec, LOAD: 0.2345 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0694 sec, LOAD: 2.2932 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0473 sec, LOAD: 0.4014 sec.
   ```
   Package used (Python/R/Scala/Julia): Python
   
   ## Build info (Required if built from source)
   Compiler (gcc/clang/mingw/visual studio): gcc 7.2.0
   MXNet commit hash: f95e7949dcd96ca2a5a140dbcff16dd344b45d19
   
   Build config:
   ```
   make -j 40 USE_OPENCV=1 \
             USE_BLAS=openblas USE_CUDA=1 \
             USE_CUDA_PATH=/cm/shared/apps/cuda10.0/toolkit/10.0.130 \
             USE_CUDNN=1 \
             USE_NCCL=1 \
             USE_NCCL_PATH=/home/rengan/apps/nccl/2.3.7
   ```
   ## Error Message:
   ```
   Traceback (most recent call last):
     File "train_imagenet.py", line 122, in <module>
       sym = net.get_symbol(**vars(args))
     File "/home/rengan/DL_benchmark/MLPerf/mlperf_results/v0.5.0/dellemc/submission/code/image_classification/mxnet/symbols/resnet-v1b-fl.py", line 370, in get_symbol
       use_dali          = use_dali)
     File "/home/rengan/DL_benchmark/MLPerf/mlperf_results/v0.5.0/dellemc/submission/code/image_classification/mxnet/symbols/resnet-v1b-fl.py", line 250, in resnet
       cudnn_tensor_core_only=force_tensor_core)
     File "<string>", line 179, in Convolution
     File "/home/rengan/apps/mxnet/dev/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/_ctypes/symbol.py", line 125, in _symbol_creator
       ctypes.byref(sym_handle)))
     File "/home/rengan/apps/mxnet/dev/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/base.py", line 252, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: Cannot find argument 'cudnn_algo_verbose', Possible Arguments:
   ----------------
   kernel : Shape(tuple), required
       Convolution kernel size: (w,), (h, w) or (d, h, w)
   stride : Shape(tuple), optional, default=[]
       Convolution stride: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
   dilate : Shape(tuple), optional, default=[]
       Convolution dilate: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
   pad : Shape(tuple), optional, default=[]
       Zero pad for convolution: (w,), (h, w) or (d, h, w). Defaults to no padding.
   num_filter : int (non-negative), required
       Convolution filter(channel) number
   num_group : int (non-negative), optional, default=1
       Number of group partitions.
   workspace : long (non-negative), optional, default=1024
       Maximum temporary workspace allowed (MB) in convolution.This parameter has two usages. When CUDNN is not used, it determines the effective batch size of the convolution kernel. When CUDNN is used, it controls the maximum temporary storage used for tuning the best CUDNN kernel when `limited_workspace` strategy is used.
   no_bias : boolean, optional, default=0
       Whether to disable bias parameter.
   cudnn_tune : {None, 'fastest', 'limited_workspace', 'off'},optional, default='None'
       Whether to pick convolution algo by running performance test.
   cudnn_off : boolean, optional, default=0
       Turn off cudnn for this layer.
   layout : {None, 'NCDHW', 'NCHW', 'NCW', 'NDHWC', 'NHWC'},optional, default='None'
       Set layout for input, output and weight. Empty for
       default layout: NCW for 1d, NCHW for 2d and NCDHW for 3d.NHWC and NDHWC are only supported on GPU.
   , in operator Convolution(name="", layout="NHWC", no_bias="True", cudnn_algo_verbose="0", cudnn_algo_fwd="1", cudnn_algo_bwd_data="1", cudnn_algo_bwd_filter="1", stride="(2, 2)", cudnn_tensor_core_only="1", pad="(3, 3)", num_filter="64", kernel="(7, 7)", workspace="256")
   
   ```
   
   ## Minimum reproducible example
   The mxnet implementation of image classification model Resnet50 in MLPerf:
   https://github.com/mlperf/results/tree/master/v0.5.0/nvidia/submission/code/image_classification/mxnet
   
   ## Steps to reproduce
   first install the dependence nvidia-dali:
   ```
   $ pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali
   ```
   then run the benchmark:
   ```
   $ python3.6 train_imagenet.py \
   --gpus 0,1,2,3 \
   --batch-size 1664 \
   --kv-store device \
   --lr 0.6 \
   --lr-step-epochs 30,60,80 \
   --warmup-epochs 5 \
   --eval-period 4 \
   --eval-offset 2 \
   --optimizer sgd \
   --network resnet-v1b-fl \
   --num-layers 50 \
   --num-epochs 100 \
   --accuracy-threshold 0.749 \
   --dtype float16 \
   --use-dali \
   --disp-batches 20 \
   --image-shape 4,224,224 \
   --fuse-bn-relu 1 \
   --fuse-bn-add-relu 1 \
   --min-random-area 0.05 \
   --max-random-area 1.0 \
   --conv-algo 1 \
   --force-tensor-core 1 \
   --input-layout NHWC \
   --conv-layout NHWC \
   --batchnorm-layout NHWC \
   --pooling-layout NHWC \
   --batchnorm-mom 0.9 \
   --batchnorm-eps 1e-5 \
   --data-train /mnt/isilon/DeepLearning/database/mlperf/ilsvrc2012_mxnet/train.rec \
   --data-train-idx /mnt/isilon/DeepLearning/database/mlperf/ilsvrc2012_mxnet/train.idx \
   --data-val /mnt/isilon/DeepLearning/database/mlperf/ilsvrc2012_mxnet/val.rec \
   --data-val-idx /mnt/isilon/DeepLearning/database/mlperf/ilsvrc2012_mxnet/val.idx \
   --dali-prefetch-queue 2 \
   --dali-nvjpeg-memory-padding 64
   ```
   
   ## What have you tried to solve it?
   
   1. This error first appeared when I installed mxnet-cu100mkl using pip. Then I built from the source with the latest development code which has all output shown above. But it still has the same error. 
   2. Since this error is related to cudnn, I check whether the mxnet library is linked to libcudnn or not, and the result is yes:
   ```
   [root@node001 ~]# ldd /home/rengan/apps/mxnet/dev/lib/libmxnet.so |grep cudnn
           libcudnn.so.7 => /home/rengan/apps/cudnn/7.4.1/lib64/libcudnn.so.7 (0x00002aaade829000)
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services