You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2019/11/08 02:43:09 UTC

[GitHub] [incubator-tvm] FrozenGene commented on issue #4277: [ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule.

FrozenGene commented on issue #4277: [ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule.
URL: https://github.com/apache/incubator-tvm/pull/4277#issuecomment-551359926
 
 
   > @jackwish I'd be very interested in those results. I got some good results for NHWC on ARMv7 by porting the QNNPACK kernels over
   > and tensorizing (https://github.com/ajtulloch/tvm/blob/95e5e2d44a08e2dfb8444706370505944ffb7c91/topi/python/topi/arm_cpu/conv2d_int8.py#L9-L166), and it'd be awesome to see how you folks have approached this problem.
   
   @ajtulloch Thanks for the interest and great discussion between us ever. :-) 
   
   I want to summary some high idea of us and will present the results next TVM meetup in Shanghai.
   
   For Convolution:
   1. We use NHWC layout
   2. Currently, we use Tensorize.
   
   We stuied QNNPACK, but QNNPACK can not be used by us directly, some concept in QNNPACK we can not simulate, for example, indirect buffer. So we write the kernel by ourselves.
   
   For Depthwise Convolution
   1. We use NHWC layout
   2. We don't use Tensorize.
   
   Yes. We use INT6 * INT16 + INT16 -> INT32 instruction (SMLAL), which is better than INT32*INT32 + INT32->INT32. The way we do is we will substract the input_zero_point / kernel_zero_point before computation, at there, we will cast the dtype from UINT8 -> INT16.
   
   For Depthwise convolution, even though we don't use Tensorize, we still get the performance bettern than QNNPACK(in mobilenet V1 / mobilenetV2, only 2 layers slower than it, others we are faster than QNNPACK). Amazing result. I wanna list two keypoints:
   
   1. Avoid data pack. In im2col / spatial pack, we will do data pack on H / W, which is cost on depthwise convolution, you could compute it directly and just split C. i.e. like this:
   ```
       kvshape = (C // VC, M, KH, KW, VC)
       oshape = (N, OH, OW, C)
       dvshape = (N, OH, OW, C // VC, KH, KW, VC)
   ```
   2. compute_at is very important in depthwise convoltion. `data_pad_inline` / `data_vec_inline` / `conv_inline` should be tunable, this is one important factor to beyond QNNPACK. 
   
   Currently, we have tested MobilenetV2 on rasp, we are 1.34X compared with QNNPACK. In our in-house model, we are beyond more compared with QNNPACK. We will present more in TVM meetup.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services