You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tvm.apache.org by Giuseppe Rossini via Apache TVM Discuss <no...@discuss.tvm.ai> on 2020/09/10 17:04:12 UTC

[Apache TVM Discuss] [Development/RFC] [RFC] Accelerate quantized convolution through dot-product


## Motivation

In recent RFCs we successfully boosted convolution performance on native Armv8-A  architectures. When using Armv8.2-A and above ISAs, developers are provided with a richer set of instructions, among which the dot-product instruction `udot` (or `sdot`) can be particularly useful for Machine Learning applications (as a reference, see the[ Neoverse optimization guide](https://static.docs.arm.com/swog309707/a/Arm_Neoverse_N1_Software_Optimization_Guide.pdf)).

## Basic udot/sdot functioning

The instruction

```udot v0.4s, v1.16b, v2.16b```

Subdivides the registers `v1` and `v2` in blocks of 4 `uint8` elements and places their dot-product into the corresponding 32bit word in `v0`. You can see this operation depicted in the following picture:

![](https://confluence.arm.com/download/attachments/550375790/dotproduct.PNG?version=1&modificationDate=1599150323354&api=v2 "ML Engineering > Improve quantized convolution through dot product instructions and tensorization > dotproduct.PNG")

Another less known version of this instruction is the indexed dot-product:

```udot v0.4s, v1.16b, v2.16b[0]```

This instruction is taking the first 4 `uint8` elements of vector `v2` and producing the dot-product with each groups of 4 elements from vector `v1`. This is depicted in the following picture:

![](https://confluence.arm.com/download/attachments/550375790/indexed_dotprod.PNG?version=1&modificationDate=1599150523909&api=v2 "ML Engineering > Improve quantized convolution through dot product instructions and tensorization > indexed_dotprod.PNG")

This last version is the one we will use through the remaining of this RFC. 

## Implementation strategy

We decided to add dot-product support through two intrinsics and to exploit those intrinsics through tensorization. Differently from the previous intrinsic for Armv8-A (which was written through inline assembly), we have been able to write them entirely through TIR/LLVM instructions.  The main difference is that, given two tiles `tile_A` and `tile_B` the output `tile_C` produced with the dot-product is partial but correct. In the case of Armv8-A, instead, we needed some additional assembly magic (i.e., `addp` instructions)  to produce the correct partial tile. 

### Strategy #1: 8x12 output tile, A interleaved and B transposed and interleaved

In this case the approach is very similar to the [Armv8-A RFC](https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920).

**Interleave A:**  We interleave (and pad if necessary) the rows of A in blocks of `8x4`. This means that each tile will contain 4 consecutive elements of 8 rows of A.

**Interleave and transpose B:**  We block transpose `B` as in [Armv8-A RFC](https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920). In this case though, we use blocks of `12x4`. Each tile of the reshaped `B` will contain 4 consecutive elements of  12 columns of `B`

**Computation through dot-product:**  We use an `mmla4x4` intrinsic in order to produce a `4x4` (interleaved) tile given `4x4` tiles from `A` and `B`.  Please note that we will unroll it by two, in order to produce the correct `8x4` output tile. 

This is the rule we are using:

```
vec_a = ins[0].vload([0, 0], dtype_vec) # Now vec_a contains 4 rows of A (4 elements each)
vec_aa = [select_word(vec_a, i, dtype_vec) for i in range(0, 4)] # Select the i-th row
vec_b = ins[1].vload([0, 0], dtype_vec) # vec_b contains the 4 columns of B (4 elements each)

# Execute the matrix multiplication
for i in range(0, 4):
    vec_c = outs[0].vload([i, 0], 'int32x4')
    vdot = tvm.tir.call_llvm_intrin(
                          'int32x4',
                          'llvm.aarch64.neon.sdot',
                          tvm.tir.const(3, 'uint32'),
                          vec_c, vec_b, vec_aa[i])

      # Store the result
      ib.emit(outs[0].vstore([i, 0], vdot))
```

We will give some more information about `select_word` later in this RFC

### Strategy #2: 4x16 output tile, A native and B transposed and interleaved

This strategy is different from the one we previously adopted, and deserves some more explanation. 

**A is in native form:**  We don't interleave `A`, but we do pad it if necessary. Now the i-th  load instruction is loading 16 elements from the i-th row of `A`

**Interleave and transpose B:**  For `B` nothing changes. We tile in the same way we did previously, but with a different `16x4`  tile shape. Each tile of the reshaped B will contain 4 consecutive elements of  16 columns of `B`

**Computation through dot-product:**  We use an `mmla16x4` intrinsic. The inputs are a `Rx4` tile of `A` (`R` is the number of resulting rows) and a 16x4 tile of `B`. Before showing any code, we provide the tiled computation in the following picture, where `R` is set to 4. The idea is the following:

1. A single load reads 16 consecutive elements from matrix `A` (which is in its native form). 4 of them
are green, 4 of them are blue and so on
2. The first output row C[0,0:4] is produced in the following way:
```
  `C[0, 0:4] = A[0,0:4] *B_interleaved_t[0:4,0:4]`
  `C[0, 0:4] += A[0,4:8] *B_interleaved_t[4:8,0:4]`
  `C[0, 0:4] += A[0,8:12] *B_interleaved_t[8:12,0:4]`
  `C[0, 0:4] += A[0,12:16] *B_interleaved_t[12:16,0:4]`
```
3.  Repeat the same operation for each the `R` rows of C

![](https://confluence.arm.com/download/attachments/550375790/mmla16x4%20%281%29.PNG?version=1&modificationDate=1599154955087&api=v2 "ML Engineering > Improve quantized convolution through dot product instructions and tensorization > mmla16x4 (1).PNG")

Few things worth underlying:
* In the picture we tried to render the algorithm with different colors: multiplications only happen between tiles of same colors
* The tiles of `B-interleaved_t` in the picture do not represent the real memory layout. Basically tile `[0,0]` is stored by rows, followed by tile `[1,0]`, `[2,0]`, `[3,0]`, `[0, 1]`, etc... (this reinforces the fact that `B_interleaved_t` is a block transposed version of `B`)
* Very importantly, **the output C is already in its native form. We thus don't need to unpack it**

For completeness we write down the tensorization node we use to implement the above tiled computation: 

```
for k in range(0, rows):
    vec_a = ins[0].vload([k, 0], dtype_vec)

        for j in range(0, 4):
            for i in range(0, 4):
                vec_aa = select_word(vec_a, i, dtype_vec)
                vec_b = ins[1].vload([i, 4*j, 0], dtype_vec)
                vec_c = outs[0].vload([k, 4*j], 'int32x4')
                vdot = tvm.tir.call_llvm_intrin(
                       'int32x4',
                       'llvm.aarch64.neon.sdot',
                       tvm.tir.const(3, 'uint32'),
                       vec_c, vec_b, vec_aa)
```

### How to produce the correct indexed dot-product: select_word() function: 

The indexed dot-product is not available as an LLVM intrinsic. It is instead produced as a LLVM/IR optimization when we do:

```

# Reinterpret vec_a as 4 int32 words
vec_int32 = tvm.tir.call_intrin('int32x4', 'tir.reinterpret', vec)
# Broadcast the lane-th word
vec_int32_shuffled = tvm.tir.Shuffle([vec_int32], [lane, lane, lane, lane])
# Convert back to uint8x16
vec_int8_broadcast = tvm.tir.call_intrin(dtype_vec, 'tir.reinterpret', vec_int32_shuffled)

udot(vec_c, vec_b, vec_int8_broadcast)

```

The first 3 instructions are implemented in a utility function named `select_word` in `topi/arm_cpu/tensor_intrin.py`

### Why implementing both strategies?

If we run some math, we can see that the number of memory accesses in the case of the interleaved approach is slightly smaller compared to the hybrid approach. However, the idea is that the hybrid kernels don't need interleaving of data and  un-interleaving of the output. Since we try to fuse those transformations it is not entirely clear which one is best. The best approach is to let the tuner decide the winner

## Performance improvements

In order to initially test performance improvements, we consider again `inception_V3`  (which is a good benchmark, given its shape variety) running on a [Neoverse N1](https://developer.arm.com/ip-products/processors/neoverse/neoverse-n1) machine. 

The results we measured are as follows:

* 2.41x improvement compared to the Armv8-A implementation
* About 5% slower than ArmNN (which uses ACL under the hood)

These are encouraging results which is why we will submit this improvement as is, before adventuring in more exotic optimizations. 

## Next steps

**Comparing performance across different networks**

While the results for `inception_v3` were satisfactory, we will compare performance for other networks against ArmNN. This is to understand if there are big gaps that need to be considered.

**Improving performance further: padding and fusion**

The hybrid strategy aims at avoiding memory-bound operations (like packing/unpacking) and gives us the possibility to fuse the requantization directly during the main computation. However, we ran into the following issues:

* Since we are applying the `mmla16x4` intrinsic through tensorization, we need to pad `A` beforehand which is actually a memory-bound operation, defeating the benefits given by this approach. Simple approaches to remove padding seem ineffective:

* * If we don't pad and run tensorize over a variable dimension tiles, it simply fails (see [this discuss post](https://discuss.tvm.ai/t/loop-partitioning-and-tensorization-work-on-different-ir-levels/876))
  * If we don't pad and run tensorize only over fixed dimension tiles, `@tir.likely` statements appear hitting performance.
* For the same reason we cannot fuse the requantization during the computation. In addition to the inability to `compute_at` within tensorize, we are also blocked by the inability to `compute_at` on fused/split axis  (as mentioned [in this post](https://discuss.tvm.apache.org/t/fuse-split-compute-at-issues/7862/))

We are currently working to find a well designed solution in order to address both the issues. Possible solutions are still begin evaluated and every suggestion is welcome!





---
[Visit Topic](https://discuss.tvm.apache.org/t/rfc-accelerate-quantized-convolution-through-dot-product/7873/1) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/1a39f206a74523cfb7aad86d5cd6c6f6946d1974bec777f0ccc565d1dce2fc79).

[Apache TVM Discuss] [Development/RFC] [RFC] Accelerate quantized convolution through dot-product

Posted by Giuseppe Rossini via Apache TVM Discuss <no...@discuss.tvm.ai>.

cc @anijain2305, @FrozenGene, @ramana-arm





---
[Visit Topic](https://discuss.tvm.apache.org/t/rfc-accelerate-quantized-convolution-through-dot-product/7873/2) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/93553ef57303b2060af27afb98e6a6f8a9ab9bd44227c5e31f8b638b2a70f05b).