You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/15 12:23:48 UTC

[GitHub] elabeca opened a new issue #9087: MXNET 1.0.0 - marginal performance improvement Titan V (Volta) with half precision cuda 9.0 + cudnn 7.0.5

elabeca opened a new issue #9087: MXNET 1.0.0 - marginal performance improvement Titan V (Volta) with half precision cuda 9.0 + cudnn 7.0.5
URL: https://github.com/apache/incubator-mxnet/issues/9087
 
 
   Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.
   
   For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io 
   
   ## Description
   Running the following script: example/image-classification/train_cifar10.py on MXNET v1.0.0, seems to have marginal performance increase on Titan V (Volta) cards with half precision set and CUDA 9.0 / CUDNN 7.0.5
   
   ## Environment info (Required)
   
   ----------Python Info----------
   ('Version      :', '2.7.12')
   ('Compiler     :', 'GCC 5.4.0 20160609')
   ('Build        :', ('default', 'Nov 20 2017 18:23:56'))
   ('Arch         :', ('64bit', 'ELF'))
   ------------Pip Info-----------
   ('Version      :', '9.0.1')
   ('Directory    :', '/home/elie/.local/lib/python2.7/site-packages/pip')
   ----------MXNet Info-----------
   ('Version      :', '1.0.0')
   ('Directory    :', '/home/elie/mxnet/python/mxnet')
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   ('Platform     :', 'Linux-4.10.0-42-generic-x86_64-with-Ubuntu-16.04-xenial')
   ('system       :', 'Linux')
   ('node         :', 'zeus')
   ('release      :', '4.10.0-42-generic')
   ('version      :', '#46~16.04.1-Ubuntu SMP Mon Dec 4 15:57:59 UTC 2017')
   ----------Hardware Info----------
   ('machine      :', 'x86_64')
   ('processor    :', 'x86_64')
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                12
   On-line CPU(s) list:   0-11
   Thread(s) per core:    2
   Core(s) per socket:    6
   Socket(s):             1
   NUMA node(s):          1
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 63
   Model name:            Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
   Stepping:              2
   CPU MHz:               1397.308
   CPU max MHz:           4100.0000
   CPU min MHz:           1200.0000
   BogoMIPS:              6999.98
   Virtualisation:        VT-x
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              15360K
   NUMA node0 CPU(s):     0-11
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0608 sec, LOAD: 0.9193 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0531 sec, LOAD: 0.3470 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0671 sec, LOAD: 0.4028 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0546 sec, LOAD: 0.2149 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.3649 sec, LOAD: 0.3953 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.3955 sec, LOAD: 0.8753 sec.
   
   Package used (Python/R/Scala/Julia):
   Python 2.7
   
   ## Build info (Required if built from source)
   
   Compiler (gcc/clang/mingw/visual studio):
   gcc
   
   MXNet commit hash:
   25720d0e3c29232a37e2650f3ba3a2454f9367bb
   
   Build config:
   #-------------------------------------------------------------------------------
   #  Template configuration for compiling mxnet
   #
   #  If you want to change the configuration, please use the following
   #  steps. Assume you are on the root directory of mxnet. First copy the this
   #  file so that any local changes will be ignored by git
   #
   #  $ cp make/config.mk .
   #
   #  Next modify the according entries, and then compile by
   #
   #  $ make
   #
   #  or build in parallel with 8 threads
   #
   #  $ make -j8
   #-------------------------------------------------------------------------------
   
   #---------------------
   # choice of compiler
   #--------------------
   
   export CC = gcc
   export CXX = g++
   export NVCC = nvcc
   
   # whether compile with options for MXNet developer
   DEV = 0
   
   # whether compile with debug
   DEBUG = 0
   
   # whether compile with profiler
   USE_PROFILER =
   
   # whether to turn on signal handler (e.g. segfault logger)
   USE_SIGNAL_HANDLER =
   
   # the additional link flags you want to add
   ADD_LDFLAGS =
   
   # the additional compile flags you want to add
   ADD_CFLAGS =
   
   #---------------------------------------------
   # matrix computation libraries for CPU/GPU
   #---------------------------------------------
   
   # whether use CUDA during compile
   USE_CUDA = 1
   
   # add the path to CUDA library to link and compile flag
   # if you have already add them to environment variable, leave it as NONE
   # USE_CUDA_PATH = /usr/local/cuda
   USE_CUDA_PATH = /usr/local/cuda
   
   # whether use CuDNN R3 library
   USE_CUDNN = 1
   
   #whether to use NCCL library
   USE_NCCL = 0
   #add the path to NCCL library
   USE_NCCL_PATH = NONE
   
   # whether use opencv during compilation
   # you can disable it, however, you will not able to use
   # imbin iterator
   USE_OPENCV = 1
   
   #whether use libjpeg-turbo for image decode without OpenCV wrapper
   USE_LIBJPEG_TURBO = 0
   #add the path to libjpeg-turbo library
   USE_LIBJPEG_TURBO_PATH = NONE
   
   # use openmp for parallelization
   USE_OPENMP = 1
   
   # MKL ML Library for Intel CPU/Xeon Phi
   # Please refer to MKL_README.md for details
   
   # MKL ML Library folder, need to be root for /usr/local
   # Change to User Home directory for standard user
   # For USE_BLAS!=mkl only
   MKLML_ROOT=/usr/local
   
   # whether use MKL2017 library
   USE_MKL2017 = 0
   
   # whether use MKL2017 experimental feature for high performance
   # Prerequisite USE_MKL2017=1
   USE_MKL2017_EXPERIMENTAL = 0
   
   # whether use NNPACK library
   USE_NNPACK = 0
   
   # choose the version of blas you want to use
   # can be: mkl, blas, atlas, openblas
   # in default use atlas for linux while apple for osx
   UNAME_S := $(shell uname -s)
   ifeq ($(UNAME_S), Darwin)
   USE_BLAS = apple
   else
   USE_BLAS = atlas
   endif
   
   # whether use lapack during compilation
   # only effective when compiled with blas versions openblas/apple/atlas/mkl
   USE_LAPACK = 1
   
   # path to lapack library in case of a non-standard installation
   USE_LAPACK_PATH =
   
   # by default, disable lapack when using MKL
   # switch on when there is a full installation of MKL available (not just MKL2017/MKL_ML)
   ifeq ($(USE_BLAS), mkl)
   USE_LAPACK = 0
   endif
   
   # add path to intel library, you may need it for MKL, if you did not add the path
   # to environment variable
   USE_INTEL_PATH = NONE
   
   # If use MKL only for BLAS, choose static link automatically to allow python wrapper
   ifeq ($(USE_MKL2017), 0)
   ifeq ($(USE_BLAS), mkl)
   USE_STATIC_MKL = 1
   endif
   else
   USE_STATIC_MKL = NONE
   endif
   
   #----------------------------
   # Settings for power and arm arch
   #----------------------------
   ARCH := $(shell uname -a)
   ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
   	USE_SSE=0
   else
   	USE_SSE=1
   endif
   
   #----------------------------
   # distributed computing
   #----------------------------
   
   # whether or not to enable multi-machine supporting
   USE_DIST_KVSTORE = 0
   
   # whether or not allow to read and write HDFS directly. If yes, then hadoop is
   # required
   USE_HDFS = 0
   
   # path to libjvm.so. required if USE_HDFS=1
   LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server
   
   # whether or not allow to read and write AWS S3 directly. If yes, then
   # libcurl4-openssl-dev is required, it can be installed on Ubuntu by
   # sudo apt-get install -y libcurl4-openssl-dev
   USE_S3 = 0
   
   #----------------------------
   # performance settings
   #----------------------------
   # Use operator tuning
   USE_OPERATOR_TUNING = 1
   
   # Use gperftools if found
   USE_GPERFTOOLS = 1
   
   # Use JEMalloc if found, and not using gperftools
   USE_JEMALLOC = 1
   
   #----------------------------
   # additional operators
   #----------------------------
   
   # path to folders containing projects specific operators that you don't want to put in src/operators
   EXTRA_OPERATORS =
   
   #----------------------------
   # other features
   #----------------------------
   
   # Create C++ interface package
   USE_CPP_PACKAGE = 0
   
   #----------------------------
   # plugins
   #----------------------------
   
   # whether to use caffe integration. This requires installing caffe.
   # You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
   # CAFFE_PATH = $(HOME)/caffe
   # MXNET_PLUGINS += plugin/caffe/caffe.mk
   
   # whether to use torch integration. This requires installing torch.
   # You also need to add TORCH_PATH/install/lib to your LD_LIBRARY_PATH
   # TORCH_PATH = $(HOME)/torch
   # MXNET_PLUGINS += plugin/torch/torch.mk
   
   # WARPCTC_PATH = $(HOME)/warp-ctc
   # MXNET_PLUGINS += plugin/warpctc/warpctc.mk
   
   # whether to use sframe integration. This requires build sframe
   # git@github.com:dato-code/SFrame.git
   # SFRAME_PATH = $(HOME)/SFrame
   # MXNET_PLUGINS += plugin/sframe/plugin.mk
   
   ## Error Message:
   None
   
   ## Minimum reproducible example
   time python2 train_cifar10.py --dtype float16 --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0
   
   vs
   
   time python2 train_cifar10.py --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0
   
   ## Steps to reproduce
   
   time python2 train_cifar10.py --dtype float16 --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0
   
   vs
   
   time python2 train_cifar10.py --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0
   
   ## What have you tried to solve it?
   
   Compared results between a Titan V (Volta) card and a Titan X (Pascal card). Tried with and without half precision set for the train_cifar10.py example on resenet, one epoch, 110 layers and 512 batch size.
   
   Results for Volta (Titan V) with dtype float16 flag set:
   
   INFO:root:start with arguments Namespace(batch_size=512, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float16', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=1, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
   [11:33:43] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding..
   [11:33:47] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding..
   [11:33:48] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   INFO:root:Epoch[0] Batch [20]	Speed: 2734.21 samples/sec	accuracy=0.142020
   INFO:root:Epoch[0] Batch [40]	Speed: 2709.23 samples/sec	accuracy=0.202832
   INFO:root:Epoch[0] Batch [60]	Speed: 2724.39 samples/sec	accuracy=0.233984
   INFO:root:Epoch[0] Batch [80]	Speed: 2751.87 samples/sec	accuracy=0.268652
   INFO:root:Epoch[0] Train-accuracy=0.303653
   INFO:root:Epoch[0] Time cost=18.777
   INFO:root:Epoch[0] Validation-accuracy=0.314453
   
   real	0m26.451s
   user	0m36.516s
   sys	0m9.708s
   
   
   Results for Volta (Titan V) without half-precision flag set:
   
   time python2 train_cifar10.py --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0
   INFO:root:start with arguments Namespace(batch_size=512, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=1, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
   [11:30:53] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding..
   [11:30:56] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding..
   [11:30:58] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   INFO:root:Epoch[0] Batch [20]	Speed: 2855.89 samples/sec	accuracy=0.121931
   INFO:root:Epoch[0] Batch [40]	Speed: 2933.23 samples/sec	accuracy=0.191406
   INFO:root:Epoch[0] Batch [60]	Speed: 2944.27 samples/sec	accuracy=0.239551
   INFO:root:Epoch[0] Batch [80]	Speed: 2871.48 samples/sec	accuracy=0.271289
   INFO:root:Epoch[0] Train-accuracy=0.301356
   INFO:root:Epoch[0] Time cost=17.768
   INFO:root:Epoch[0] Validation-accuracy=0.340820
   
   real	0m25.560s
   user	0m34.052s
   sys	0m9.416s
   
   Results for Pascal (Titan X) with dtype float16 flag set:
   
   time python2 train_cifar10.py --dtype float16 --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0
   INFO:root:start with arguments Namespace(batch_size=512, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float16', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=1, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
   [11:33:43] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding..
   [11:33:47] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding..
   [11:33:48] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   INFO:root:Epoch[0] Batch [20]	Speed: 2734.21 samples/sec	accuracy=0.142020
   INFO:root:Epoch[0] Batch [40]	Speed: 2709.23 samples/sec	accuracy=0.202832
   INFO:root:Epoch[0] Batch [60]	Speed: 2724.39 samples/sec	accuracy=0.233984
   INFO:root:Epoch[0] Batch [80]	Speed: 2751.87 samples/sec	accuracy=0.268652
   INFO:root:Epoch[0] Train-accuracy=0.303653
   INFO:root:Epoch[0] Time cost=18.777
   INFO:root:Epoch[0] Validation-accuracy=0.314453
   
   real	0m26.451s
   user	0m36.516s
   sys	0m9.708s
   
   Results for Pascal (Titan X) without half-precision flag set:
   
   time python2 train_cifar10.py --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 2
   INFO:root:start with arguments Namespace(batch_size=512, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='2', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=1, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
   [11:32:37] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding..
   [11:32:41] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding..
   [11:32:42] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   INFO:root:Epoch[0] Batch [20]	Speed: 2228.10 samples/sec	accuracy=0.141927
   INFO:root:Epoch[0] Batch [40]	Speed: 2234.42 samples/sec	accuracy=0.199609
   INFO:root:Epoch[0] Batch [60]	Speed: 2258.77 samples/sec	accuracy=0.235449
   INFO:root:Epoch[0] Batch [80]	Speed: 2237.78 samples/sec	accuracy=0.266992
   INFO:root:Epoch[0] Train-accuracy=0.286880
   INFO:root:Epoch[0] Time cost=22.809
   INFO:root:Epoch[0] Validation-accuracy=0.343164
   
   real	0m31.823s
   user	0m41.688s
   sys	0m11.076s
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services