You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/06/09 16:45:30 UTC

[GitHub] [incubator-tvm] giuseros opened a new pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

giuseros opened a new pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754


   ### RFC
   This PR is based on the following RFC: https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920
   
   ### High level description of the submission
   
   The main algorithm lives in:
   * topi/python/topi/arm_cpu/conv2d_gemm.py(schedule)
   * topi/python/topi/arm_cpu/tensor_intrin.py (assembly+intrinsic)
   * topi/python/topi/arm_cpu/conv2d_int8.py(driver)
   
   The Weight transform touches different files (since it is computed at compile time):
   * topi/python/topi/arm_cpu/conv2d_alter_op.py
   * python/tvm/relay/op/nn/_nn.py
   * python/tvm/relay/op/nn/nn.py
   * src/relay/op/nn/convolution.h (relay node definition)
   * src/relay/op/nn/convolution.cc(relay node definition)
   * include/tvm/relay/attrs/nn.h (relay node definition)
   
   Strategies (mapping relay-node -> compute+schedules) are defined here:
   * python/tvm/relay/op/strategy/arm_cpu.py
   * python/tvm/relay/op/strategy/generic.py


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642690898


   > Hi @FrozenGene ,
   > I agree that different strategies should be available to the auto-tuner. See if the solution proposed is good enough for you (at least as a temporary work-around). For Armv7-A or NCHW, nothing changes, we follow exactly the previous path.
   > 
   > For Armv8-A and NHWC we don't convert during the legalization step, but during the `_alter_conv2d_layout` pass. The only difference now is that the offset contribution will be added after the convolution instead than before.
   > 
   > I agree that a better solution, where the legalization changes depending on the strategy, would be better. However, I don't think the legalization step has got enough information to know the strategy (for now).
   > 
   > What do you think?
   
   I think it is ok.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r440711660



##########
File path: python/tvm/relay/op/nn/nn.py
##########
@@ -2134,6 +2202,25 @@ def contrib_conv2d_winograd_weight_transform(weight,
     return _make.contrib_conv2d_winograd_weight_transform(weight, tile_size)
 
 
+def contrib_conv2d_gemm_weight_transform(weights):

Review comment:
       I added the assert in `conv2d_alter_op.py` as done for Winograd. Also, I am asserting the (input and kernel) layouts in [Conv2dGemmRel](https://github.com/apache/incubator-tvm/pull/5754/files#diff-d16c1ec50ceaac0e06f963a425d325b4R568), still following the approach followed for Winograd. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] anijain2305 edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

anijain2305 edited a comment on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-644902827


   @FrozenGene Can you please review when you get time?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642564985


   Hi @FrozenGene 
   Just to clarify: I am enjoying the discussion, and since the optimization space is wild, I agree that is worth valuating different approaches. 
   * About the Raspberry+mobilenet v2, good to know you are working on Armv8-A (sorry to have assumed otherwise). However, there is still the point that mobilenet uses shallow convolutions, while I am addressing deeper and more generic convolutions. 
   * Are you saying that, as things stand now in TVM, the `conv2d_nhwc_spatial_pack` schedule might be faster than the gemm approach on smaller CPUs? Unfortunately, for now I don't think they can be added together because of what I said above about the legalization step. Do you know any work-around to that? Maybe I can legalize only for specific devices (e.g., only for Cortex-A55)? 
   * Finally, as things stand now we might get this PR in, and later do a more detailed comparison across different networks + CPUs


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] anijain2305 edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

anijain2305 edited a comment on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-643422726


   @FrozenGene @giuseros If QNN Legalization is causing issues, we can remove QNN legalization for ARM CPUs altogether and move the logic to Alter Op layout. Alter op layout might become more complicated (like we might have to handle uint8 x int8 input and kernel dtype in alter op layout now). Just an idea if consolidating things at one place makes life easier.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r440044546



##########
File path: python/tvm/relay/op/nn/nn.py
##########
@@ -1976,6 +1976,74 @@ def contrib_conv2d_winograd_without_weight_transform(data,
         kernel_layout, out_layout, out_dtype)
 
 
+def contrib_conv2d_gemm_without_weight_transform(data,
+                                                 weight,
+                                                 strides=(1, 1),
+                                                 padding=(0, 0),
+                                                 dilation=(1, 1),
+                                                 groups=1,
+                                                 channels=None,
+                                                 kernel_size=None,
+                                                 data_layout="NCHW",
+                                                 kernel_layout="OIHW",
+                                                 out_layout="",
+                                                 out_dtype=""):
+    r"""2D convolution with gemm algorithm.

Review comment:
       All the other help texts are represented as a raw strings (i.e., starting with an "r"). I think this is because raw strings leave  backslashes in the string, and the string can be parsed later by help formatter tools. In this particular case, I don't need a raw string, since I don't have any backslash in the help text, so if you wish I can remove it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642374967


   Thanks for the great work! I have some quick question:
   1. Have you tested various models arm cpu? (like A53, A72, A55, A75 and so on). According to fb qnnpack blog, it is not always could get best performance using `umul` compared with `smlal` instruction (used by now). (https://engineering.fb.com/ml-applications/qnnpack/). So just change legalization and give it up `smlal` instruction in aarch64 maybe doesn't make sense to me. One proof:  our coming feature `Ansor` (auto scheduler) doesn't support tensorize (at least till now), however, it could get nice performance using `smlal` instruction and beyond TFLite 1.2X on mobilenet v2 quantized model (cortex-a53) (https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4). I mean here:
   ```python
   @qnn_conv2d_legalize.register('arm_cpu')
   def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
       # ARM prefers the dtypes to be same.
       if is_aarch64_arm():
           return helper_change_dtypes_to_be_same(attrs, inputs, types, relay.qnn.op.conv2d)
       return helper_no_fast_int8_hw_legalization(attrs, inputs, types, relay.qnn.op.conv2d)
   ```
   It disables us using `SMLAL` instruction.
   
   2. I suggest we keep two schedules (tensorize and default spatial pack). Not just check `aarch64` and only use tensorize template. I mean here:
   ```python
   is_aarch64 = "aarch64" in str(isa.target)
   if is_aarch64 and data.dtype in ["int8", "uint8"]:
       strategy.add_implementation(
           wrap_compute_conv2d(topi.arm_cpu.compute_conv2d_NHWC_quantized),
           wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_NHWC_quantized),
           name="compute_conv2d_NHWC_quantized.arm_cpu")
   else:
       strategy.add_implementation(
           wrap_compute_conv2d(topi.arm_cpu.conv2d_nhwc_spatial_pack),
           wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_nhwc_spatial_pack),
           name="conv2d_nhwc_spatial_pack.arm_cpu")
   ```
   
   This is our design purpose of strategy. I suspect there is some workload our spatial pack could perform better. This situation is the same as Winograd, we could perform winograd and default template and choose better. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-643286046


   > Hi @FrozenGene ,
   > I gave it another go, but switching legalization on the strategy seems very hard (since we would need the auto-tuner to pick the best data-type for us).
   > 
   > So for now, we have to content with the `_alter_conv2d_layout` workaround and try to think a bit more on how we can infer the strategy during legalization
   
   I think I could accept this way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642607696


   So I mean to add a `convert_data_type` pass that is similar to `alter_op_layout` but converts datatype (and we can do something like `if topi_impl == 'spatial_nhwc' converts to int16`. 
   
   This doesn't seem possible directly in the `alter_op_layout` because only the shapes are passed to that function, but I will play with it another bit
   
   To reply to your question, yes, the `alter_op_layout` pass is executed when the autotuner runs


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r443699271



##########
File path: python/tvm/relay/qnn/op/legalizations.py
##########
@@ -237,17 +237,23 @@ def is_fast_int8_on_arm():
     target = tvm.target.Target.current(allow_none=False)
     return '+v8.2a,+dotprod' in ' '.join(target.options)
 
+def is_aarch64_arm():
+    """ Checks whether we are compiling for an AArch64 target. """
+    target = tvm.target.Target.current(allow_none=False)
+    return 'aarch64' in ' '.join(target.options)
+
 ########################
 # ARM CPU legalizations.
 ########################
 
 @qnn_conv2d_legalize.register('arm_cpu')
 def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
     # ARM prefers the dtypes to be same.
-    if is_fast_int8_on_arm():
+    if (is_aarch64_arm() and attrs["data_layout"] == "NHWC") or is_fast_int8_on_arm():

Review comment:
       Well, that condition is there so that all AArch32 and NCHW will follow the original path (convert to int16, subtract offsets contributions, and then convolve). I would prefer to have it this way to affect the minimum the code-base. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642588670


   Hi @FrozenGene ,
   The idea of adding the algorithm name to the attributes would work if the legalization step was run after we pick the strategy. It is instead run before, so it is unaware of the strategy picked. 
   
   Maybe we could add a new pass that runs based on the strategy? Or we can hack in `_alter_conv2d_layout`? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] anijain2305 commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

anijain2305 commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-647622695


   I push an empty commit to retrigger the CI - https://coderwall.com/p/vkdekq/git-commit-allow-empty


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642543391


   Hi @FrozenGene ,
   About the code changes. 
   1) It will be hard to do this. The point is that the legalization is done in Relay before picking the strategy (thus, it is unaware of the strategy picked). To keep both legalizations I need somehow to pass information from the strategy (e.g., the name of the algorithm, or something like that). Are you aware of any other ways I can do it?
    
   2) Note that I am targeting NHWC layout. I wasn't able to even compile with `conv2d_nhwc_spatial_pack`  for uint8 (it just hangs, at least when I tried it without auto-tuning on Armv8-A). I gathered from various discussions that NHWC support for arm targets is incomplete at the moment. So for now we might simply agree to leave this as default for NHWC and `conv2d_nchw_spatial_pack` as default for NCHW and mirror that in the legalization step which might look like:
   
   ```def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
       if is_aarch64_arm() and attrs.data_layout == "NHWC":
           return helper_change_dtypes_to_be_same(attrs, inputs, types, relay.qnn.op.conv2d)
       return helper_no_fast_int8_hw_legalization(attrs, inputs, types, relay.qnn.op.conv2d)```
   
   In a subsequent work, we can find a way to pick the correct legalization after we picked the strategy.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] anijain2305 commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

anijain2305 commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-644902827


   @FrozenGene Can you please take a look when you get time?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-641333161


   CC: @u99127 @anijain2305 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-647914319


   Thanks @giuseros @anijain2305 MERGED NOW.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r440719374



##########
File path: topi/python/topi/arm_cpu/conv2d_alter_op.py
##########
@@ -235,5 +239,37 @@ def _alter_conv2d_layout(attrs, inputs, tinfos, out_type):
              new_attrs['out_layout'], out_dtype], topi_tmpl)
         dispatch_ctx.update(target, new_workload, cfg)
         return relay.nn.contrib_depthwise_conv2d_nchwc(*inputs, **new_attrs)
+    if topi_tmpl == "conv2d_NHWC_quantized.arm_cpu":
+        assert (data.dtype == 'int8' and kernel.dtype == 'int8' or
+                data.dtype == 'uint8' and kernel.dtype == 'uint8')
+        CO, IC, KH, KW = get_const_tuple(kernel.shape)
+
+        K = KH * KW * IC
+        N = CO
+
+        pad_k = 0
+        pad_n = 0
+
+        if N % 4 != 0:
+            pad_n = 4 - (N % 4)
+
+        if K % 16 != 0:
+            pad_k = 16 - (K % 16)
+
+        N_padded = N + pad_n
+        K_padded = K + pad_k
+
+        kernel_expr = relay.nn.contrib_conv2d_gemm_weight_transform(inputs[1])

Review comment:
       As I commented above, I added few asserts in `conv2d_alter_op.py` and some were already there in `Conv2dGemmRel`. Is this ok? If you prefer, I can change the name as you suggested, but I feel we might loose generality by doing so. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642375802


   cc @ajtulloch 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642577581


   > 1. It will be hard to do this. The point is that the legalization is done in Relay before picking the strategy (thus, it is unaware of the strategy picked). To keep both legalizations I need somehow to pass information from the strategy (e.g., the name of the algorithm, or something like that). Are you aware of any other ways I can do it?
   
   @giuseros I think add the algorithm name could be one way to handle it. For example, we could add it in the `attr` and query it in the legalization pass, then we could throw it safely.
   
   > Note that I am targeting NHWC layout. I wasn't able to even compile with conv2d_nhwc_spatial_pack for uint8 (it just hangs, at least when I tried it without auto-tuning on Armv8-A). I gathered from various discussions that NHWC support for arm targets is incomplete at the moment. So for now we might simply agree to leave this as default for NHWC and conv2d_nchw_spatial_pack as default for NCHW and mirror that in the legalization step which might look like:
       ```if is_aarch64_arm() and attrs.data_layout == "NHWC":
           return helper_change_dtypes_to_be_same(attrs, inputs, types, relay.qnn.op.conv2d)
       return helper_no_fast_int8_hw_legalization(attrs, inputs, types, relay.qnn.op.conv2d)```
   
   Yes, our NHWC schedule on arm cpu doesn't be complete. After our careful testing, NHWC is also perform better than NCHW on arm cpu using Ansor (aka auto scheduler) too. So this prompts us we could improve our AutoTVM NHWC schedule on arm cpu too.  As the result I show in the post,  we use auto schedule is to leverage NHWC layout and `smlal` instruction,  I prefer we could leverage `attr[algorithm_name]` mentioned previous to keep `smlal` instruction. After auto scheduler released (we are working hard to do it, we wish after 2 weeks we could bring it in), we could see how to improve it (like generating smlal and smlal2 or your tensorize instruction), they are orthogonal but they share the same legalize pass.
   
   One background of auto scheduler: In auto scheduler, we only need tvm.compute, then we could generate schedule automatically, so we could try NHWC / NCHW easily. So there is no spatial pack schedule template concept in the auto scheduler world in fact.
   
   > About the Raspberry+mobilenet v2, good to know you are working on Armv8-A (sorry to have assumed otherwise). However, there is still the point that mobilenet uses shallow convolutions, while I am addressing deeper and more generic convolutions.
   
   So we should keep both algorithm better, right?
   
   
   > Are you saying that, as things stand now in TVM, the conv2d_nhwc_spatial_pack schedule might be faster than the gemm approach on smaller CPUs? Unfortunately, for now I don't think they can be added together because of what I said above about the legalization step. Do you know any work-around to that? Maybe I can legalize only for specific devices (e.g., only for Cortex-A55)?
   
   I think add algorithm name mentioned before maybe could help to solve it.
   
   > Finally, as things stand now we might get this PR in, and later do a more detailed comparison across different networks + CPUs
   
   Ok. I buy it in. After legalization pass we discussed is solved, I am glad to do code review carefully and handle this pr.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r443664602



##########
File path: python/tvm/relay/qnn/op/legalizations.py
##########
@@ -237,17 +237,23 @@ def is_fast_int8_on_arm():
     target = tvm.target.Target.current(allow_none=False)
     return '+v8.2a,+dotprod' in ' '.join(target.options)
 
+def is_aarch64_arm():
+    """ Checks whether we are compiling for an AArch64 target. """
+    target = tvm.target.Target.current(allow_none=False)
+    return 'aarch64' in ' '.join(target.options)
+
 ########################
 # ARM CPU legalizations.
 ########################
 
 @qnn_conv2d_legalize.register('arm_cpu')
 def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
     # ARM prefers the dtypes to be same.
-    if is_fast_int8_on_arm():
+    if (is_aarch64_arm() and attrs["data_layout"] == "NHWC") or is_fast_int8_on_arm():

Review comment:
       Do we still need this? as we have modified the logic in the alter op layout




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene merged pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene merged pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-645406262


   > @FrozenGene Can you please review when you get time?
   
   Yep. I could review it tomorrow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r440052506



##########
File path: python/tvm/relay/op/nn/nn.py
##########
@@ -2134,6 +2202,25 @@ def contrib_conv2d_winograd_weight_transform(weight,
     return _make.contrib_conv2d_winograd_weight_transform(weight, tile_size)
 
 
+def contrib_conv2d_gemm_weight_transform(weights):

Review comment:
       It doesn't need the layout (for now), but it only supports NHWC (which it assumes). Should I assert if the wrong layout is passed? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] anijain2305 commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

anijain2305 commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-643422726


   @FrozenGene @giuseros If QNN Legalization is causing issues, we can remove QNN legalization for ARM CPUs altogether and move the logic to Alter Op layout. Alter op layout might become more complicated (like we might have to handle uint8 x int8 input and kernel dtype in alter op layout now). Just an idea if it consolidating things at one place makes life easier.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene edited a comment on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642541198


   @giuseros  Glad to see we have the same thought we should let autotvm select the best.
   
   Autoscheduler reley on the legalization pass to generate smlal inst(After auto scheduler is released, let us make it better together.) One information I missed before, my testing rasp 3b+ os is Ubuntu 64 bits, not 32 bits, so the target is aarch64 too. 
   
   I mention auto scheduler is not to question your work (your work is very great!) and is orthogonal as you said. I just mention that we use smlal inst on A53 (aarch64 os mentioned before) we could get nice performance too. So I want to know on low-end arm cpu, whether smlal is better than this (as fb qnnpack blog said: The default microkernel uses the fewest possible instructions and thus delivers the best performance on low-end cores, which can execute only one NEON instruction per cycle.).
   
   So I wish we could test several arm cpus to proove our this work work well all aarch64 cores (low-end core, high-end core).
   
   Secondly, I suggest let us test mobilenet v2 too. To see that whether our pr could work well across various models. 
   
   Your work is very great but I wish let us use more data and result to make it more convincing. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-646613001


   Hi @FrozenGene , 
   Thanks for the review!
   I applied your changes, but I get a (seemingly) unrelated test failure. 
   
   Could you double check please, and let me know if this has got anything to do with my changes?
   
   Thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642671388


   @giuseros  I suddenly think of auto scheduler will have one environment value. So the change of legalization won't affect auto scheduler. We could check the value of this environment value for auto scheduler and use `smlal`. However, this problem I think we still should resolve that we should have the ability for allowing different strategies have different logic. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] anijain2305 commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

anijain2305 commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r441019086



##########
File path: topi/python/topi/arm_cpu/conv2d_alter_op.py
##########
@@ -235,5 +239,37 @@ def _alter_conv2d_layout(attrs, inputs, tinfos, out_type):
              new_attrs['out_layout'], out_dtype], topi_tmpl)
         dispatch_ctx.update(target, new_workload, cfg)
         return relay.nn.contrib_depthwise_conv2d_nchwc(*inputs, **new_attrs)
+    if topi_tmpl == "conv2d_NHWC_quantized.arm_cpu":
+        assert (data.dtype == 'int8' and kernel.dtype == 'int8' or
+                data.dtype == 'uint8' and kernel.dtype == 'uint8')
+        CO, IC, KH, KW = get_const_tuple(kernel.shape)
+
+        K = KH * KW * IC
+        N = CO
+
+        pad_k = 0
+        pad_n = 0
+
+        if N % 4 != 0:
+            pad_n = 4 - (N % 4)
+
+        if K % 16 != 0:
+            pad_k = 16 - (K % 16)
+
+        N_padded = N + pad_n
+        K_padded = K + pad_k
+
+        kernel_expr = relay.nn.contrib_conv2d_gemm_weight_transform(inputs[1])

Review comment:
       I think this is good. Thanks :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642541198


   Glad to see we have the same thought we should let autotvm select the best.
   
   Autoscheduler reley on the legalization pass to generate smlal inst(After auto scheduler is released, let us make it better together.) One information I missed before, my testing rasp 3b+ os is Ubuntu 64 bits, not 32 bits, so the target is aarch64 too. 
   
   I mention auto scheduler is not to question your work (your work is very great!) and is orthogonal as you said. I just mention that we use smlal inst on A53 (aarch64 os mentioned before) we could get nice performance too. So I want to know on low-end arm cpu, whether smlal is better than this (as fb qnnpack blog said: The default microkernel uses the fewest possible instructions and thus delivers the best performance on low-end cores, which can execute only one NEON instruction per cycle.).
   
   So I wish we could test several arm cpus to proove our this work work well all aarch64 cores (low-end core, high-end core).
   
   Secondly, I suggest let us test mobilenet v2 too. To see that whether our pr could work well across various models. 
   
   Your work is very great but I wish let us use more data and result to make it more convincing. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-647615333


   > Hi @FrozenGene , @anijain2305 ,
   > Any update on this review?
   > Also, is there a way to retrigger the tests? Or should I contact someone in particular?
   > 
   > Thanks
   
   for the CI, maybe you could force trigger it or you could comment it (and contact @jroesch ) and explain the reason? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-647889871


   @anijain2305 could you have a look another round?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene edited a comment on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642374967


   Thanks for the great work! I have some quick question:
   1. Have you tested various models arm cpu? (like A53, A72, A55, A75 and so on). According to fb qnnpack blog, it is not always could get best performance using `umul` / `uadalp ` compared with `smlal` instruction (used by now). (https://engineering.fb.com/ml-applications/qnnpack/). So just change legalization and give it up `smlal` instruction in aarch64 maybe doesn't make sense to me. One proof:  our coming feature `Ansor` (auto scheduler) doesn't support tensorize (at least till now), however, it could get nice performance using `smlal` instruction and beyond TFLite 1.2X on mobilenet v2 quantized model (cortex-a53) (https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4). I mean here:
   ```python
   @qnn_conv2d_legalize.register('arm_cpu')
   def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
       # ARM prefers the dtypes to be same.
       if is_aarch64_arm():
           return helper_change_dtypes_to_be_same(attrs, inputs, types, relay.qnn.op.conv2d)
       return helper_no_fast_int8_hw_legalization(attrs, inputs, types, relay.qnn.op.conv2d)
   ```
   It disables us using `SMLAL` instruction.
   
   2. I suggest we keep two schedules (tensorize and default spatial pack). Not just check `aarch64` and only use tensorize template. I mean here:
   ```python
   is_aarch64 = "aarch64" in str(isa.target)
   if is_aarch64 and data.dtype in ["int8", "uint8"]:
       strategy.add_implementation(
           wrap_compute_conv2d(topi.arm_cpu.compute_conv2d_NHWC_quantized),
           wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_NHWC_quantized),
           name="compute_conv2d_NHWC_quantized.arm_cpu")
   else:
       strategy.add_implementation(
           wrap_compute_conv2d(topi.arm_cpu.conv2d_nhwc_spatial_pack),
           wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_nhwc_spatial_pack),
           name="conv2d_nhwc_spatial_pack.arm_cpu")
   ```
   
   This is our design purpose of strategy. I suspect there is some workload our spatial pack could perform better. This situation is the same as Winograd, we could perform winograd and default template and choose better. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642682834


   Hi @FrozenGene , 
   I agree that different strategies should be available to the auto-tuner. See if the solution proposed is good enough for you (at least as a temporary work-around). For Armv7-A or NCHW, nothing changes, we follow exactly the previous path. 
   
   For Armv8-A and NHWC we don't convert during the legalization step, but during the `_alter_conv2d_layout` pass. The only difference now is that the offset contribution will be added after the convolution instead than before. 
   
   I agree that a better solution, where the legalization changes depending on the strategy, would be better. However, I don't think the legalization step has got enough information to know the strategy (for now). 
   
   What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642601817


   > Hi @FrozenGene ,
   > 
   > The idea of adding the algorithm name to the attributes would work if the legalization step was run after we pick the strategy. It is instead run before, so it is unaware of the strategy picked. 
   > 
   > 
   > 
   > Maybe we could add a new pass that runs based on the strategy? Or we can hack in `_alter_conv2d_layout`? 
   > 
   > 
   
   @giuseros what you mean run based on the strategy?
   
   in alter_op_layout, we could extract workload[0] to get strategy, however could you help me to double check whether our autotvm tuning will use alter_op_layout pass?(i.e. O3), I have forgot a little bit. If so, maybe we could change the dtype here according to strategy. cc @anijain2305 any better idea too?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r440055355



##########
File path: topi/python/topi/arm_cpu/conv2d_alter_op.py
##########
@@ -235,5 +239,37 @@ def _alter_conv2d_layout(attrs, inputs, tinfos, out_type):
              new_attrs['out_layout'], out_dtype], topi_tmpl)
         dispatch_ctx.update(target, new_workload, cfg)
         return relay.nn.contrib_depthwise_conv2d_nchwc(*inputs, **new_attrs)
+    if topi_tmpl == "conv2d_NHWC_quantized.arm_cpu":
+        assert (data.dtype == 'int8' and kernel.dtype == 'int8' or
+                data.dtype == 'uint8' and kernel.dtype == 'uint8')
+        CO, IC, KH, KW = get_const_tuple(kernel.shape)
+
+        K = KH * KW * IC
+        N = CO
+
+        pad_k = 0
+        pad_n = 0
+
+        if N % 4 != 0:
+            pad_n = 4 - (N % 4)
+
+        if K % 16 != 0:
+            pad_k = 16 - (K % 16)
+
+        N_padded = N + pad_n
+        K_padded = K + pad_k
+
+        kernel_expr = relay.nn.contrib_conv2d_gemm_weight_transform(inputs[1])

Review comment:
       I wasn't able to find any. The idea is similar to Winograd, but winograd uses tiling, while here I am [block transposing and interleaving](https://discuss.tvm.ai/uploads/default/original/2X/6/635d11f9639f849945788694f64346ca2bdcd461.png). 
   
   We would need the kernel layout only to assert against the wrong one. So yes, if you think it could be useful, I can do that. 
   
   As for changing the name to `contrib_conv2d_gemm_hwio_weight_transform`, I can easily implement that, but this means that if we want to suport NCHW in the future we will have to change it again (and I think this is why they didn't do that for Winograd)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-647582536


   Hi @FrozenGene , @anijain2305 ,
   Any update on this review? 
   Also, is there a way to retrigger the tests? Or should I contact someone in particular?
   
   Thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r442604112



##########
File path: src/relay/op/nn/convolution.h
##########
@@ -383,6 +383,38 @@ inline bool Conv2DWinogradWeightTransformRel(const Array<Type>& types, int num_i
   return true;
 }
 
+// Gemm convolution shape relations
+inline bool Conv2DGemmWeightTransformRel(const Array<Type>& types, int num_inputs,
+                                         const Attrs& attrs, const TypeReporter& reporter) {
+  CHECK_EQ(types.size(), 2);
+  const auto* data = types[0].as<TensorTypeNode>();

Review comment:
       Sugget naming `data` to `weight`

##########
File path: python/tvm/relay/qnn/op/legalizations.py
##########
@@ -237,17 +237,23 @@ def is_fast_int8_on_arm():
     target = tvm.target.Target.current(allow_none=False)
     return '+v8.2a,+dotprod' in ' '.join(target.options)
 
+def is_aarch64_arm():
+    """ Checks whether the hardware has support for fast Int8 arithmetic operations. """
+    target = tvm.target.Target.current(allow_none=False)
+    return 'aarch64' in ' '.join(target.options)
+
 ########################
 # ARM CPU legalizations.
 ########################
 
 @qnn_conv2d_legalize.register('arm_cpu')
 def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
     # ARM prefers the dtypes to be same.
-    if is_fast_int8_on_arm():
+    if is_aarch64_arm() and attrs["data_layout"] == "NHWC" or is_fast_int8_on_arm():

Review comment:
       Let us add `parentheses`. i.e. `if (is_aarch64_arm() and attrs["data_layout"] == "NHWC") or is_fast_int8_on_arm()`

##########
File path: topi/python/topi/arm_cpu/conv2d_gemm.py
##########
@@ -0,0 +1,176 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, unused-variable, too-many-locals
+# pylint: disable=unused-argument, redefined-builtin
+"""GEMM Convolution schedule on ARM"""
+import tvm
+from tvm import te
+from topi import nn
+from ..util import get_const_tuple
+from ..nn.util import get_pad_tuple
+from .tensor_intrin import gemv_quantized, gemv_quantized_impl
+
+
+# Compute function
+def compute_conv2d_gemm_without_weight_transform(cfg,
+                                                 data, B_interleaved_t, strides, padding, dilation,
+                                                 out_dtype, kernel_size, output_channels):
+    """Compute conv2d by transforming the input,
+    executing GEMM and transforming the output back"""
+    batches, IH, IW, IC = get_const_tuple(data.shape)
+
+    KH, KW = kernel_size
+    OC = output_channels
+
+    K_AREA = KH * KW
+
+    if isinstance(dilation, int):
+        dilation_h = dilation_w = dilation
+    else:
+        dilation_h, dilation_w = dilation
+
+    dilated_kernel_h = (KH - 1) * dilation_h + 1
+    dilated_kernel_w = (KW - 1) * dilation_w + 1
+
+    pad_top, pad_left, pad_down, pad_right = \
+        get_pad_tuple(padding, (dilated_kernel_h, dilated_kernel_w))
+    HSTR, WSTR = strides if isinstance(strides, (tuple, list)) else (strides, strides)
+
+    OH = (IH + pad_top + pad_down - dilated_kernel_h) // HSTR + 1
+    OW = (IW + pad_left + pad_right - dilated_kernel_w) // WSTR + 1
+    if pad_top or pad_left:
+        data_pad = nn.pad(data, [0, pad_top, pad_left, 0], [0, pad_down, pad_right, 0],
+                          name="data_pad")
+    else:
+        data_pad = data
+
+    # --- Im2col
+    M = OH * OW
+    K = IC * K_AREA
+    N = OC
+
+    A_shape = (batches, M, K)
+    if K_AREA == 1:
+        A = te.compute(A_shape, lambda n, x, y: data_pad[n, HSTR * (x // OW), WSTR * (x % OW), y],
+                       name='data_flatten')
+    else:
+        A = te.compute(A_shape, lambda n, x, y:
+                       data_pad[n,
+                                HSTR * (x // OW) + dilation_h * (y // IC) // KW,
+                                WSTR * (x % OW) + dilation_w * (y // IC) % KW, y % IC],
+                       name='data_im2col')
+    N_transformed = B_interleaved_t.shape[0]
+
+    # --- Pad if necessary
+    idxm = tvm.tir.indexmod
+
+    pad_m = 0
+    pad_k = 0
+
+    if M % 4 != 0:
+        pad_m = 4 - (M % 4)
+
+    if K % 16 != 0:
+        pad_k = 16 - (K % 16)
+
+    M_padded = M + pad_m
+    K_padded = K + pad_k
+
+    pad_before = (0, 0, 0)
+    pad_after = (0, pad_m, pad_k)
+
+    if pad_m != 0 or pad_k != 0:
+        A = nn.pad(A, pad_before=pad_before, pad_after=pad_after, name="A_padded")
+
+    # --- GEMM: A*B'
+    k = te.reduce_axis((0, K_padded), "k")
+
+    A_interleaved = te.compute((batches, M_padded // 4, K_padded // 16, 4, 16),
+                               lambda b, x, y, z, w: A[b, z + 4 * x, w + 16 * y],
+                               name='A_interleaved')
+
+    C_interleaved = te.compute((batches, M_padded // 4, N_transformed, 4, 4),
+                               lambda b, x, y, w, z:
+                               te.sum(A_interleaved[b, x, k//16, w, idxm(k, 16)].astype(out_dtype)*
+                                      B_interleaved_t[y, k//16, z, idxm(k, 16)].astype(out_dtype),
+                                      axis=k),
+                               name='C_interleaved')
+
+    # --- Unpack C
+    C = te.compute((batches, M, N),
+                   lambda b, x, y:
+                   C_interleaved[b, x // 4, y // 4, idxm(x, 4), idxm(y, 4)],
+                   name="C", tag='injective')
+
+    # --- Produce the conv output
+    out_shape = (batches, OH, OW, OC)
+    out = te.compute(out_shape, lambda b, x, y, z: C(b, y + OW * x, z),
+                     name='conv2d_gemm_output')
+
+    return out
+
+# Schedules
+
+

Review comment:
       unnecessary blank lines 

##########
File path: python/tvm/relay/qnn/op/legalizations.py
##########
@@ -237,17 +237,23 @@ def is_fast_int8_on_arm():
     target = tvm.target.Target.current(allow_none=False)
     return '+v8.2a,+dotprod' in ' '.join(target.options)
 
+def is_aarch64_arm():
+    """ Checks whether the hardware has support for fast Int8 arithmetic operations. """

Review comment:
       The comment is not correct.

##########
File path: src/relay/op/nn/convolution.h
##########
@@ -383,6 +383,38 @@ inline bool Conv2DWinogradWeightTransformRel(const Array<Type>& types, int num_i
   return true;
 }
 
+// Gemm convolution shape relations
+inline bool Conv2DGemmWeightTransformRel(const Array<Type>& types, int num_inputs,

Review comment:
       Suggest add code doc comment of the principle of array packing and why is 4 and 16. 

##########
File path: src/relay/op/nn/convolution.h
##########
@@ -383,6 +383,38 @@ inline bool Conv2DWinogradWeightTransformRel(const Array<Type>& types, int num_i
   return true;
 }
 
+// Gemm convolution shape relations
+inline bool Conv2DGemmWeightTransformRel(const Array<Type>& types, int num_inputs,
+                                         const Attrs& attrs, const TypeReporter& reporter) {
+  CHECK_EQ(types.size(), 2);
+  const auto* data = types[0].as<TensorTypeNode>();
+  if (data == nullptr) return false;
+
+  CHECK_EQ(data->shape.size(), 4) << "Only support HWIO kernel layout";
+
+  const auto K = data->shape[0] * data->shape[1] * data->shape[2];
+  const auto N = data->shape[3];
+
+  auto k_mod_16 = indexmod(K, 16);
+  auto n_mod_4 = indexmod(N, 4);

Review comment:
       Could we let it be one variable to accept this? Because we create one op, however I think this shouldn't be restricted to use it on arm platforms. If someone want to use it on x86 CPU (like AVX512) , the number could be different.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-644025534


   @anijain2305 , thanks for the review! About getting rid of the legalization, I would not do that for now. It is in my backlog to go back to this issue and try to retrieve the strategy from the legalization pass. This should give us more optimization options. If that turns out to be not possible, then yes, I would remove the pass and do everything in the `alter_layout` pass. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r440711660



##########
File path: python/tvm/relay/op/nn/nn.py
##########
@@ -2134,6 +2202,25 @@ def contrib_conv2d_winograd_weight_transform(weight,
     return _make.contrib_conv2d_winograd_weight_transform(weight, tile_size)
 
 
+def contrib_conv2d_gemm_weight_transform(weights):

Review comment:
       I added the assert in `conv2d_alter_op.py` as done for Winograd. Also, I am asserting the (input and kernel) layouts in [Conv2dGemmRel](https://github.com/apache/incubator-tvm/pull/5754/files#diff-d16c1ec50ceaac0e06f963a425d325b4R568), still following the approach used by Winograd. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros edited a comment on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642520722

Hi @FrozenGene ,
Thanks a lot for your comments. I will address general replies here, and code comments in a separate reply.

* I indeed read your discuss [post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4), but I thought the work was orthogonal to this one. My main goal here is to have a fast general convolution algorithm for Armv8-A. Your post talks about mobilenet v2, and raspi 3.
* In mobilenet v2 there are no deep convolutional layers, mostly depthwise convolutions and 1x1 convolutions. With shallow convolutions the problem becomes memory bound, and the differences among the algorithms become less evident. That is also why I picked inception_v3, where there are 1x1, 3x3, 5x5, 1x7, 7x1 convolutions.
* Raspi 3 comes with a 32bit operative system, which means using Armv7-A. The problem with Armv7-A is that instead of having 32 registers (as in Armv8-A) you have only 16, so the optimization space is reduced. Also, I think (but I am not 100% sure) that the guys in TFlite do not extremely optimize for Armv7-A. Indeed, on Armv7-A @anijain2305 shows (in the same post you mention) a 0.80 ratio for tflite/tvm (while I see a 0.60/0.30 ratio for multi/single thread scenarios, respectively ).
* The Qnnpack post you mention explicitly says that: "the microkernel that leverages the dual issue capability proves to be 15 percent to 20 percent faster for a sufficiently large channel count (K > 64)"
* The way they do convolution (and gemm) in Qnnpack for Armv8-A is by using a combination of `smlal` and `smlal2` (plus a combination of `usubl` and `usubl2`) while `conv2d_nhwc_spatial_pack` only uses `smal`. It is true that in Armv7-A they only use `vsmal` (and `vusubl`). So, I wonder if the autoscheduler (which I am not familiar with) is able to generate such combinations for armv8.
* I did not try other CPUs other than the Cortex-A76. The point is that I am not using anything specific for that CPU, but only specific to the Armv8-A ISA.
* I agree that in case of smaller convolutions (or depthwise convolutions) there are simpler algorithms that work as well (or even faster). I also agree in stacking multiple strategies and let TVM select the best.

I will reply on the code in the following comment.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros edited a comment on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642520722

Hi @FrozenGene ,
Thanks a lot for your comments. I will address general replies here, and code comments in a separate reply.

* I indeed read your discuss [post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4), but I thought the work was orthogonal to this one. My main goal here is to have a fast general convolution algorithm for armv8. Your post talks about mobilenet v2, and raspi 3.
* In mobilenet v2 there are no deep convolutional layers, mostly depthwise convolutions and 1x1 convolutions. With shallow convolutions the problem becomes memory bound, and the differences among the algorithms become less evident. That is also why I picked inception_v3, where there are 1x1, 3x3, 5x5, 1x7, 7x1 convolutions.
* Raspi 3 comes with a 32bit operative system, which means using armv7. The problem with armv7 is that instead of having 32 registers (as in armv8) you have only 16, so the optimization space is reduced. Also, I think (but I am not 100% sure) that the guys in TFlite do not extremely optimize for armv7. Indeed, on armv7 @anijain2305 shows (in the same post you mention) a 0.80 ratio for tflite/tvm (while I see a 0.60/0.30 ratio for multi/single thread scenarios, respectively ).
* The Qnnpack post you mention explicitly says that: "the microkernel that leverages the dual issue capability proves to be 15 percent to 20 percent faster for a sufficiently large channel count (K > 64)"
* The way they do convolution (and gemm) in Qnnpack for armv8 is by using a combination of `smlal` and `smlal2` (plus a combination of `usubl` and `usubl2`) while `conv2d_nhwc_spatial_pack` only uses `smal`. It is true that in armv7 they only use `vsmal` (and `vusubl`). So, I wonder if the autoscheduler (which I am not familiar with) is able to generate such combinations for armv8.
* I did not try other CPUs other than the Cortex-A76. The point is that I am not using anything specific for that CPU, but only specific to the armv8 ISA.
* I agree that in case of smaller convolutions (or depthwise convolutions) there are simpler algorithms that work as well (or even faster). I also agree in stacking multiple strategies and let TVM select the best.

I will reply on the code in the following comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-646684376


   It actually seems related to: https://github.com/apache/incubator-tvm/issues/5827


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] anijain2305 commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

anijain2305 commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r439561237



##########
File path: python/tvm/relay/op/nn/nn.py
##########
@@ -1976,6 +1976,74 @@ def contrib_conv2d_winograd_without_weight_transform(data,
         kernel_layout, out_layout, out_dtype)
 
 
+def contrib_conv2d_gemm_without_weight_transform(data,
+                                                 weight,
+                                                 strides=(1, 1),
+                                                 padding=(0, 0),
+                                                 dilation=(1, 1),
+                                                 groups=1,
+                                                 channels=None,
+                                                 kernel_size=None,
+                                                 data_layout="NCHW",
+                                                 kernel_layout="OIHW",
+                                                 out_layout="",
+                                                 out_dtype=""):
+    r"""2D convolution with gemm algorithm.

Review comment:
       Is r necessary?

##########
File path: python/tvm/relay/op/nn/nn.py
##########
@@ -2134,6 +2202,25 @@ def contrib_conv2d_winograd_weight_transform(weight,
     return _make.contrib_conv2d_winograd_weight_transform(weight, tile_size)
 
 
+def contrib_conv2d_gemm_weight_transform(weights):

Review comment:
       Does this need layout?

##########
File path: python/tvm/relay/op/nn/_nn.py
##########
@@ -421,6 +421,24 @@ def compute_mirror_pad(attrs, inputs, out_dtype):
 reg.register_pattern("nn.contrib_conv2d_winograd_without_weight_transform",
                      OpPattern.OUT_ELEMWISE_FUSABLE)
 
+# conv2d_gemm related operators
+reg.register_strategy("nn.contrib_conv2d_gemm_without_weight_transform",
+                      strategy.conv2d_gemm_without_weight_transform_strategy)
+reg.register_pattern("nn.contrib_conv2d_gemm_without_weight_transform",
+                     OpPattern.OUT_ELEMWISE_FUSABLE)
+
+
+@reg.register_compute("nn.contrib_conv2d_gemm_weight_transform")
+def compute_contrib_conv2d_gemm_weight_transform(attrs, inputs, out_dtype):
+    """Compute definition of contrib_conv2d_gemm_weight_transform"""
+    out = topi.nn.conv2d_gemm_weight_transform(
+        inputs[0])

Review comment:
       Can we can move this to previous line?

##########
File path: topi/python/topi/arm_cpu/conv2d_alter_op.py
##########
@@ -235,5 +239,37 @@ def _alter_conv2d_layout(attrs, inputs, tinfos, out_type):
              new_attrs['out_layout'], out_dtype], topi_tmpl)
         dispatch_ctx.update(target, new_workload, cfg)
         return relay.nn.contrib_depthwise_conv2d_nchwc(*inputs, **new_attrs)
+    if topi_tmpl == "conv2d_NHWC_quantized.arm_cpu":
+        assert (data.dtype == 'int8' and kernel.dtype == 'int8' or
+                data.dtype == 'uint8' and kernel.dtype == 'uint8')
+        CO, IC, KH, KW = get_const_tuple(kernel.shape)
+
+        K = KH * KW * IC
+        N = CO
+
+        pad_k = 0
+        pad_n = 0
+
+        if N % 4 != 0:
+            pad_n = 4 - (N % 4)
+
+        if K % 16 != 0:
+            pad_k = 16 - (K % 16)
+
+        N_padded = N + pad_n
+        K_padded = K + pad_k
+
+        kernel_expr = relay.nn.contrib_conv2d_gemm_weight_transform(inputs[1])

Review comment:
       Was wondering if it is possible to represent weight transform as a sequence of existing relay ops? In that case, we would not need a new contrib op, we can put that sequence here, and foldConstant will optimize the sequence away.
   
   If not, do you think we need to pass kernel layout information. Also should we call it on the lines of `contrib_conv2d_gemm_hwio_weight_transform`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642520722

Hi @FrozenGene ,
Thanks a lot for your comments. I will address general replies here, and code comments in a separate reply.

* I indeed read your discuss [post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4), but I thought the work was orthogonal to this one. My main goal here is to have a fast general convolution algorithm for armv8. Your post talks about mobilenet v2, and raspi 3.
* In mobilenet v2 there are no deep convolutional layers, mostly depthwise convolutions and 1x1 convolutions. With shallow convolutions the problem becomes memory bound, and the differences among the algorithms become less evident. That is also why I picked inception_v3, where there are 1x1, 3x3, 5x5, 1x7, 7x1 convolutions.
* Raspi 3 comes with a 32bit operative system, which means using armv7. The problem with armv7 is that instead of having 32 registers (as in armv8) you have only 16, so the optimization space is reduced. Also, I think (but I am not 100% sure) that the guys in TFlite do not extremely optimize for armv7. Indeed, on armv7 @anijain2305 shows (in the same post you mention) a 0.80 ratio for tflite/tvm (while I see a 0.60/0.30 ratio for multi/single thread scenarios, respectively ).
* The Qnnpack post you mention explicitly says that: "the microkernel that leverages the dual issue capability proves to be 15 percent to 20 percent faster for a sufficiently large channel count (K > 64)"
* The way they do convolution (and gemm) in Qnnpack for armv8 is by using a combination of `smlal` and `smlal2` (plus a combination of `usubl` and `usubl2`) while `conv2d_nhwc_spatial_pack` only uses `smal`. It is true that in armv7 they only use `vsmal` (and `vusubl`). So, I wonder if the autoscheduler (which I am not familiar with) is able to generate such combinations for armv8.
* I did not try other CPUs other than the a76. The point is that I am not using anything specific for that CPU, but only specific to the armv8 ISA.
* I agree that in case of smaller convolutions (or depthwise convolutions) there are simpler algorithms that work as well (or even faster). I also agree in stacking multiple strategies and let TVM select the best.

I will reply on the code in the following comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] anijain2305 commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

anijain2305 commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r440384299



##########
File path: python/tvm/relay/op/nn/nn.py
##########
@@ -1976,6 +1976,74 @@ def contrib_conv2d_winograd_without_weight_transform(data,
         kernel_layout, out_layout, out_dtype)
 
 
+def contrib_conv2d_gemm_without_weight_transform(data,
+                                                 weight,
+                                                 strides=(1, 1),
+                                                 padding=(0, 0),
+                                                 dilation=(1, 1),
+                                                 groups=1,
+                                                 channels=None,
+                                                 kernel_size=None,
+                                                 data_layout="NCHW",
+                                                 kernel_layout="OIHW",
+                                                 out_layout="",
+                                                 out_dtype=""):
+    r"""2D convolution with gemm algorithm.

Review comment:
       Thanks for explanation. No need to remove.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on a change in pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on a change in pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#discussion_r440237827



##########
File path: python/tvm/relay/op/nn/nn.py
##########
@@ -2134,6 +2202,25 @@ def contrib_conv2d_winograd_weight_transform(weight,
     return _make.contrib_conv2d_winograd_weight_transform(weight, tile_size)
 
 
+def contrib_conv2d_gemm_weight_transform(weights):

Review comment:
       Let us add one assert. It is the same as Winograd.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] FrozenGene commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

FrozenGene commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642651252


   > So I mean to add a `convert_data_type` pass that is similar to `alter_op_layout` but converts datatype (and we can do something like `if topi_impl == 'spatial_nhwc' converts to int16`.
   
   I think this is one interesting pass. Like we have `_alter_op_layout` and will have different logic for different strategy , then we have `_alter_op_dtype` pass and will have different logic for different strategy. 
   
   However, this pass seems do most of the same thing in legalize (change dtype).  So our legalization pass should complete this work according to different strategy. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros commented on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros commented on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642801815


   Hi @FrozenGene ,
   I gave it another go, but switching legalization on the strategy seems very hard (since we would need the auto-tuner to pick the best data-type for us). 
   
   So for now, we have to content  with the `_alter_conv2d_layout` workaround and try to think a bit more on how we can infer the strategy during  legalization


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] giuseros edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Posted by GitBox <gi...@apache.org>.

giuseros edited a comment on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642520722

Hi @FrozenGene ,
Thanks a lot for your comments. I will address general replies here, and code comments in a separate reply.

* I indeed read your discuss [post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4), but I thought the work was orthogonal to this one. My main goal here is to have a fast general convolution algorithm for Armv8-A. Your post talks about mobilenet v2, and raspi 3.
* In mobilenet v2 there are no deep convolutional layers, mostly depthwise convolutions and 1x1 convolutions. With shallow convolutions the problem becomes memory bound, and the differences among the algorithms become less evident. That is also why I picked inception_v3, where there are 1x1, 3x3, 5x5, 1x7, 7x1 convolutions.
* Raspi 3 comes with a 32bit operative system, which means using Armv7-A. The problem with Armv7-A is that instead of having 32 registers (as in Armv8-A) you have only 16, so the optimization space is reduced. Also, I think (but I am not 100% sure) that the guys in TFlite do not extremely optimize for Armv7-A. Indeed, on Armv7-A @anijain2305 shows (in the same post you mention) a 0.80 ratio for tflite/tvm (while I see a 0.60/0.30 ratio for multi/single thread scenarios, respectively ).
* The Qnnpack post you mention explicitly says that: "the microkernel that leverages the dual issue capability proves to be 15 percent to 20 percent faster for a sufficiently large channel count (K > 64)"
* The way they do convolution (and gemm) in Qnnpack for Armv8-A is by using a combination of `smlal` and `smlal2` (plus a combination of `usubl` and `usubl2`) while `conv2d_nhwc_spatial_pack` only uses `smal`. It is true that in armv7 they only use `vsmal` (and `vusubl`). So, I wonder if the autoscheduler (which I am not familiar with) is able to generate such combinations for armv8.
* I did not try other CPUs other than the Cortex-A76. The point is that I am not using anything specific for that CPU, but only specific to the Armv8-A ISA.
* I agree that in case of smaller convolutions (or depthwise convolutions) there are simpler algorithms that work as well (or even faster). I also agree in stacking multiple strategies and let TVM select the best.

I will reply on the code in the following comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org