You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/01/30 17:43:38 UTC

[GitHub] Vikas89 commented on issue #14029: Out of memory error in 3d Conv for matrix splits > 10, CUDNN strange behaviour

Vikas89 commented on issue #14029: Out of memory error in 3d Conv for matrix splits > 10, CUDNN strange behaviour
URL: https://github.com/apache/incubator-mxnet/issues/14029#issuecomment-459039082
 
 
   Thanks, I was able to repro the issue.
   
   With slice of 11, looks like it is going through different cuda code.
   With 11 this is where branching occurs:
   
   ```
   _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_13slice_forwardILi5ELi1EN7mshadow3gpuEEEJPfS7_NS4_5ShapeILi5EEES9_NS_6common11StaticArrayIiLi5EEESC_EEEviDpT0_ [731]
   3.32607s  42.809ms         (131072 1 1)       (16 16 1)        56  10.160KB        0B         -           -           -           -    Tesla K80 (0)         1        14  void fft3d_r2c_16x16x16<float, float, float2>(float2*, float*, int3, int3, int3, int3, int3, bool) [766]
   3.36884s  25.372ms          (76800 1 1)       (16 16 1)        56  10.160KB        0B         -           -           -           -    Tesla K80 (0)         1        18  void fft3d_r2c_16x16x16<float, float, float2>(float2*, float*, int3, int3, int3, int3, int3, bool) [786]
   3.39416s  53.754ms          (2048 72 1)       (256 1 1)        46  8.1250KB        0B         -           -           -           -    Tesla K80 (0)         1        14  void transpose_readWrite_alignment_kernel<float2, float2, int=1, bool=0, int=6, int=4, int=4>(cublasTransposeParams<float2>, float2 const *, float2*, float2 const *) [774] ...
   
   ```
   
   See details below for gpu stack trace for :10,:10 vs :11,:11
   
   Details of gpu allocation with a = net(x[:, :, :, :10, :10])
   
   
   ```
   [01:26:50] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
   ==21970== NVPROF is profiling process 21970, command: python bb.py
   (300, 512, 2, 10, 10)
   ==21970== Profiling application: python bb.py
   [01:26:56] src/engine/naive_engine.cc:55: Engine shutdown
   ==21970== Profiling result:
      Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
   2.69533s  2.3040us                    -               -         -         -         -      112B  46.359MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.02670s  2.0480us                    -               -         -         -         -      112B  52.154MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.02685s  1.5680us                    -               -         -         -         -      112B  68.120MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.02702s  1.5680us                    -               -         -         -         -      112B  68.120MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.02718s  1.5680us                    -               -         -         -         -      112B  68.120MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.02776s  4.8000us                    -               -         -         -         -  2.3750KB  483.19MB/s      Device           -    Tesla K80 (0)         1        26  [CUDA memset]
   3.02778s  3.7440us                    -               -         -         -         -  2.3750KB  619.48MB/s      Device           -    Tesla K80 (0)         1        27  [CUDA memset]
   3.02780s  3.7120us                    -               -         -         -         -  2.3750KB  624.82MB/s      Device           -    Tesla K80 (0)         1        28  [CUDA memset]
   3.02782s  3.7120us                    -               -         -         -         -  2.3750KB  624.82MB/s      Device           -    Tesla K80 (0)         1        29  [CUDA memset]
   3.02861s  2.1120us                    -               -         -         -         -      112B  50.574MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.02959s  1.8464ms                    -               -         -         -         -  13.500MB  7.1400GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.03217s  131.20us          (13824 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [594]
   3.07020s  3.6699ms                    -               -         -         -         -  27.000MB  7.1848GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.07460s  257.12us          (27648 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [610]
   3.07704s  71.007us                    -               -         -         -         -  512.00KB  6.8765GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.07748s  6.9440us            (512 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [626]
   3.12820s  3.7064ms                    -               -         -         -         -  27.000MB  7.1141GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.13270s  257.18us          (27648 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [642]
   3.16837s  3.6899ms                    -               -         -         -         -  27.000MB  7.1458GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.17289s  257.18us          (27648 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [658]
   3.17373s  81.503us            (128 1 1)       (256 1 1)        24        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  mxnet::common::random::rand_generator_seed_kernel(curandStatePhilox4_32_10*, int, unsigned int) [667]
   3.17402s  85.599us            (128 1 1)       (256 1 1)        24        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  mxnet::common::random::rand_generator_seed_kernel(curandStatePhilox4_32_10*, int, unsigned int) [676]
   3.17430s  84.639us            (128 1 1)       (256 1 1)        24        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  mxnet::common::random::rand_generator_seed_kernel(curandStatePhilox4_32_10*, int, unsigned int) [685]
   3.17458s  84.127us            (128 1 1)       (256 1 1)        24        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  mxnet::common::random::rand_generator_seed_kernel(curandStatePhilox4_32_10*, int, unsigned int) [694]
   3.17542s  2.4320us                    -               -         -         -         -        4B  1.5685MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.17543s  1.5680us                    -               -         -         -         -        4B  2.4328MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.17547s  50.267ms            (128 1 1)       (256 1 1)        32        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_18SampleNormalKernelIN7mshadow3gpuEEEJNS_6common6random13RandGeneratorIS5_fEEiijjPfSB_SB_EEEviDpT0_ [716]
   3.22654s  3.6994ms          (60000 1 1)       (256 1 1)        25        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_13slice_forwardILi5ELi1EN7mshadow3gpuEEEJPfS7_NS4_5ShapeILi5EEES9_NS_6common11StaticArrayIiLi5EEESC_EEEviDpT0_ [731]
   3.23131s  329.61ms            (938 2 1)        (8 32 1)       108  10.250KB        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, float const *, kernel_convNd_params, int, float, float, int, float const *, float const *) [758]
   3.56288s  2.2720us                    -               -         -         -         -  2.0000KB  859.65MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.56324s  2.1430us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [775]
   3.56381s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.56415s  1.9840us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [791]
   3.56472s  2.0160us                    -               -         -         -         -  2.0000KB  968.81MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.56525s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.56639s  6.4328ms            (512 1 1)       (256 1 1)        36        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>(int=1, mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>, float, int=1, int=1, int=1, int=1, int=1, float, unsigned int) [827]
   3.57302s  1.8631ms           (7500 1 1)       (128 1 1)        20        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*) [848]
   3.57518s  674.13ms            (938 2 1)        (8 32 1)       108  10.250KB        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, float const *, kernel_convNd_params, int, float, float, int, float const *, float const *) [875]
   4.25052s  2.3040us                    -               -         -         -         -  2.0000KB  847.71MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.25068s  3.2000us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [890]
   4.25129s  2.0800us                    -               -         -         -         -  2.0000KB  939.00MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.25162s  1.9840us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [906]
   4.25218s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.25273s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.25329s  6.4278ms            (512 1 1)       (256 1 1)        36        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>(int=1, mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>, float, int=1, int=1, int=1, int=1, int=1, float, unsigned int) [940]
   4.26006s  13.323ms            (938 2 1)        (8 32 1)       108  10.250KB        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, float const *, kernel_convNd_params, int, float, float, int, float const *, float const *) [972]
   4.27427s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.27440s  2.4320us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [985]
   4.27495s  1.9520us                    -               -         -         -         -  2.0000KB  0.9771GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.27528s  1.9840us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [1001]
   4.27584s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.27636s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.27751s  6.4224ms            (512 1 1)       (256 1 1)        36        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>(int=1, mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>, float, int=1, int=1, int=1, int=1, int=1, float, unsigned int) [1037]
   4.28411s  2.9166ms          (65535 1 1)       (256 1 1)        10        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_ [1054]
   4.28768s  1.8733ms           (7500 1 1)       (128 1 1)        20        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*) [1070]
   4.28985s  672.74ms            (938 2 1)        (8 32 1)       108  10.250KB        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, float const *, kernel_convNd_params, int, float, float, int, float const *, float const *) [1098]
   4.96379s  2.4000us                    -               -         -         -         -  2.0000KB  813.80MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.96393s  3.2960us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [1111]
   4.96454s  2.0480us                    -               -         -         -         -  2.0000KB  953.67MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.96489s  1.9840us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [1127]
   4.96545s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.96597s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.96653s  6.4416ms            (512 1 1)       (256 1 1)        36        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>(int=1, mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>, float, int=1, int=1, int=1, int=1, int=1, float, unsigned int) [1161]
   4.97314s  1.8597ms           (7500 1 1)       (128 1 1)        20        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*) [1182]
   4.97525s  673.72ms            (938 2 1)        (8 32 1)       108  10.250KB        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, cudnn::detail::implicit_convolveNd_sgemm<float, int=3, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, float const *, kernel_convNd_params, int, float, float, int, float const *, float const *) [1206]
   5.65040s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   5.65055s  2.5600us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [1221]
   5.65116s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   5.65149s  1.9520us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [1237]
   5.65205s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   5.65257s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   5.65322s  6.4391ms            (512 1 1)       (256 1 1)        36        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>(int=1, mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>, float, int=1, int=1, int=1, int=1, int=1, float, unsigned int) [1271]
   5.65984s  2.9161ms          (65535 1 1)       (256 1 1)        10        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_ [1288]
   5.66288s  1.8583ms           (7500 1 1)       (128 1 1)        20        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*) [1302]
   
   
   
   ```
   
   
   With a = net(x[:, :, :, :11, :11])
   
   ```
   ==22076== Profiling application: python bb.py
   [01:27:32] src/engine/naive_engine.cc:55: Engine shutdown
   ==22076== Profiling result:
      Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
   2.70511s  2.3040us                    -               -         -         -         -      112B  46.359MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.03518s  2.0160us                    -               -         -         -         -      112B  52.982MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.03530s  1.5680us                    -               -         -         -         -      112B  68.120MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.03547s  1.5680us                    -               -         -         -         -      112B  68.120MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.03563s  1.5360us                    -               -         -         -         -      112B  69.539MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.03622s  4.9280us                    -               -         -         -         -  2.3750KB  470.64MB/s      Device           -    Tesla K80 (0)         1        26  [CUDA memset]
   3.03625s  3.9360us                    -               -         -         -         -  2.3750KB  589.26MB/s      Device           -    Tesla K80 (0)         1        27  [CUDA memset]
   3.03626s  3.7440us                    -               -         -         -         -  2.3750KB  619.48MB/s      Device           -    Tesla K80 (0)         1        28  [CUDA memset]
   3.03628s  3.7120us                    -               -         -         -         -  2.3750KB  624.82MB/s      Device           -    Tesla K80 (0)         1        29  [CUDA memset]
   3.03714s  2.3360us                    -               -         -         -         -      112B  45.724MB/s    Pageable      Device    Tesla K80 (0)         1         7  [CUDA memcpy HtoD]
   3.03808s  1.8467ms                    -               -         -         -         -  13.500MB  7.1389GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.04068s  130.24us          (13824 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [594]
   3.10729s  3.6815ms                    -               -         -         -         -  27.000MB  7.1621GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.11170s  256.96us          (27648 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [610]
   3.11457s  71.168us                    -               -         -         -         -  512.00KB  6.8610GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.11509s  6.9760us            (512 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [626]
   3.18066s  3.7273ms                    -               -         -         -         -  27.000MB  7.0742GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.18521s  256.99us          (27648 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [642]
   3.24959s  3.6985ms                    -               -         -         -         -  27.000MB  7.1292GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.25406s  257.05us          (27648 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [658]
   3.25491s  79.680us            (128 1 1)       (256 1 1)        24        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  mxnet::common::random::rand_generator_seed_kernel(curandStatePhilox4_32_10*, int, unsigned int) [667]
   3.25519s  82.495us            (128 1 1)       (256 1 1)        24        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  mxnet::common::random::rand_generator_seed_kernel(curandStatePhilox4_32_10*, int, unsigned int) [676]
   3.25547s  81.023us            (128 1 1)       (256 1 1)        24        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  mxnet::common::random::rand_generator_seed_kernel(curandStatePhilox4_32_10*, int, unsigned int) [685]
   3.25575s  82.688us            (128 1 1)       (256 1 1)        24        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  mxnet::common::random::rand_generator_seed_kernel(curandStatePhilox4_32_10*, int, unsigned int) [694]
   3.25662s  2.4640us                    -               -         -         -         -        4B  1.5482MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.25663s  1.5680us                    -               -         -         -         -        4B  2.4328MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   3.25667s  50.391ms            (128 1 1)       (256 1 1)        32        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_18SampleNormalKernelIN7mshadow3gpuEEEJNS_6common6random13RandGeneratorIS5_fEEiijjPfSB_SB_EEEviDpT0_ [716]
   3.30790s  4.4087ms          (65535 1 1)       (256 1 1)        25        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_13slice_forwardILi5ELi1EN7mshadow3gpuEEEJPfS7_NS4_5ShapeILi5EEES9_NS_6common11StaticArrayIiLi5EEESC_EEEviDpT0_ [731]
   3.32607s  42.809ms         (131072 1 1)       (16 16 1)        56  10.160KB        0B         -           -           -           -    Tesla K80 (0)         1        14  void fft3d_r2c_16x16x16<float, float, float2>(float2*, float*, int3, int3, int3, int3, int3, bool) [766]
   3.36884s  25.372ms          (76800 1 1)       (16 16 1)        56  10.160KB        0B         -           -           -           -    Tesla K80 (0)         1        18  void fft3d_r2c_16x16x16<float, float, float2>(float2*, float*, int3, int3, int3, int3, int3, bool) [786]
   3.39416s  53.754ms          (2048 72 1)       (256 1 1)        46  8.1250KB        0B         -           -           -           -    Tesla K80 (0)         1        14  void transpose_readWrite_alignment_kernel<float2, float2, int=1, bool=0, int=6, int=4, int=4>(cublasTransposeParams<float2>, float2 const *, float2*, float2 const *) [774]
   3.44786s  31.851ms          (1200 72 1)       (256 1 1)        46  8.1250KB        0B         -           -           -           -    Tesla K80 (0)         1        18  void transpose_readWrite_alignment_kernel<float2, float2, int=1, bool=0, int=6, int=4, int=4>(cublasTransposeParams<float2>, float2 const *, float2*, float2 const *) [794]
   3.47972s  431.09ms           (8 5 2304)       (16 16 1)       127  8.1445KB        0B         -           -           -           -    Tesla K80 (0)         1        18  cgemm_strided_batched_sm35_ldg_tn_64x8x64x16x16 [816]
   3.91082s  122.35ms          (36 4800 1)       (256 1 1)        46  8.1250KB        0B         -           -           -           -    Tesla K80 (0)         1        18  void transpose_readWrite_alignment_kernel<float2, float2, int=1, bool=0, int=6, int=4, int=4>(cublasTransposeParams<float2>, float2 const *, float2*, float2 const *) [824]
   4.03318s  46.092ms         (153600 1 1)       (16 16 1)        59  10.160KB        0B         -           -           -           -    Tesla K80 (0)         1        18  void fft3d_c2r_16x16x16<float2, float, float>(float*, float2*, int3, int3, int3, int3, int3, float, float, bool, int, float*, float*) [840]
   4.09256s  2.6560us                    -               -         -         -         -  2.0000KB  735.36MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.09280s  2.5920us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [873]
   4.09326s  1.9840us                    -               -         -         -         -  2.0000KB  984.44MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.09347s  2.0160us              (2 1 1)       (256 1 1)         8        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_ [889]
   4.09391s  1.9830us                    -               -         -         -         -  2.0000KB  984.93MB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.09436s  1.9520us                    -               -         -         -         -  2.0000KB  0.9771GB/s    Pageable      Device    Tesla K80 (0)         1        14  [CUDA memcpy HtoD]
   4.09517s  7.2510ms            (512 1 1)       (256 1 1)        36        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>(int=1, mxnet::op::batchnorm::cuda::BatchNormalizationUpdateOutputInferenceKernel<float, float, mxnet::op::batchnorm::cuda::DeviceTensor<float, int=1>, mxnet::op::batchnorm::BNTensor3<float>>, float, int=1, int=1, int=1, int=1, int=1, float, unsigned int) [925]
   4.10261s  2.3114ms           (9600 1 1)       (128 1 1)        20        0B        0B         -           -           -           -    Tesla K80 (0)         1        14  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*) [946]
   
   Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
   SSMem: Static shared memory allocated per CUDA block.
   DSMem: Dynamic shared memory allocated per CUDA block.
   SrcMemType: The type of source memory accessed by memory operation/copy
   DstMemType: The type of destination memory accessed by memory operation/copy
   ======== Error: Application returned non-zero code 1
   
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services