You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/07/22 22:46:45 UTC

[GitHub] [incubator-tvm] giuseros opened a new pull request #6117: Use auto-tuner to improve conv2d_gemm performance

giuseros opened a new pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117


   ## High level description of this contribution
   The following tuning entities have been introduced:
   - Unrolling and vectorizing input matrix transform
   - Reordering gemm to exploit parallel threads
   - Unrolling `gemm_quantized` intrinsic
   - Interleaving `gemm_quantized` intrinsic
   
   Main files touched:
   * `topi/python/topi/arm_cpu/tensor_intrin.py`
   * `topi/python/topi/arm_cpu/conv2_gemm.py`
   
   ## RFC
   The RFC for this submission is available [here](https://discuss.tvm.ai/t/rfc-use-auto-tuner-to-improve-conv2d-gemm-performance/7392)
   
   Change-Id: Icd3ab005663f78a80672e71ef368f6d0efa4a401
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] giuseros commented on a change in pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
giuseros commented on a change in pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#discussion_r475669627



##########
File path: python/tvm/topi/arm_cpu/conv2d_int8.py
##########
@@ -142,6 +142,7 @@ def schedule_conv2d_NHWC_quantized(cfg, outs):
     n, h, w, c = out.op.axis
     outer, inner = s[out].split(c, 4)
     s[out].vectorize(inner)
+    s[out].parallel(h)

Review comment:
       I also fused batch and first outer dimensions in all `conv2d_gemm`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] FrozenGene commented on pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
FrozenGene commented on pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#issuecomment-676508295


   i can not spare time in reviewing this. Will do it tomorrow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] FrozenGene merged pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
FrozenGene merged pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] FrozenGene commented on a change in pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
FrozenGene commented on a change in pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#discussion_r477065824



##########
File path: python/tvm/topi/arm_cpu/tensor_intrin.py
##########
@@ -21,7 +21,186 @@
 from tvm import te
 from tvm.contrib import util, clang
 
-def gemv_quantized_impl(M, N, data_type='uint8'):
+def gemm_quantized_4_4_batched():
+    return """
+           // First half
+           // Higher part of a0 * {b0,b1,b2,b3}
+           "umull v8.8h, v0.8b, v4.8b\\n"
+           "umull v9.8h, v0.8b, v5.8b\\n"
+           "umull v10.8h, v0.8b, v6.8b\\n"
+           "umull v11.8h, v0.8b, v7.8b\\n"
+
+           // Higher part of a1 * {b0,b1,b2,b3}
+           "umull v12.8h, v1.8b, v4.8b\\n"
+           "umull v13.8h, v1.8b, v5.8b\\n"
+           "umull v14.8h, v1.8b, v6.8b\\n"
+           "umull v15.8h, v1.8b, v7.8b\\n"
+
+           // Accumulate
+           "uadalp v16.4s, v8.8h\\n"
+           "uadalp v17.4s, v9.8h\\n"
+           "uadalp v18.4s, v10.8h\\n"
+           "uadalp v19.4s, v11.8h\\n"
+           "uadalp v20.4s, v12.8h\\n"
+           "uadalp v21.4s, v13.8h\\n"
+           "uadalp v22.4s, v14.8h\\n"
+           "uadalp v23.4s, v15.8h\\n"
+
+           // Lower part of a0 * {b0,b1,b2,b3}
+           "umull2 v8.8h, v0.16b, v4.16b\\n"
+           "umull2 v9.8h, v0.16b, v5.16b\\n"
+           "umull2 v10.8h, v0.16b, v6.16b\\n"
+           "umull2 v11.8h, v0.16b, v7.16b\\n"
+
+           // Lower part of a1 * {b0,b1,b2,b3}
+           "umull2 v12.8h, v1.16b, v4.16b\\n"
+           "umull2 v13.8h, v1.16b, v5.16b\\n"
+           "umull2 v14.8h, v1.16b, v6.16b\\n"
+           "umull2 v15.8h, v1.16b, v7.16b\\n"
+
+            // Accumulate again
+           "uadalp v16.4s, v8.8h\\n"
+           "uadalp v17.4s, v9.8h\\n"
+           "uadalp v18.4s, v10.8h\\n"
+           "uadalp v19.4s, v11.8h\\n"
+           "uadalp v20.4s, v12.8h\\n"
+           "uadalp v21.4s, v13.8h\\n"
+           "uadalp v22.4s, v14.8h\\n"
+           "uadalp v23.4s, v15.8h\\n"
+
+           // Second half
+           // Lower part of a2 * {b0,b1,b2,b3}
+           "umull v8.8h, v2.8b, v4.8b\\n"
+           "umull v9.8h, v2.8b, v5.8b\\n"
+           "umull v10.8h, v2.8b, v6.8b\\n"
+           "umull v11.8h, v2.8b, v7.8b\\n"
+
+           // Lower part of a3 * {b0,b1,b2,b3}
+           "umull v12.8h, v3.8b, v4.8b\\n"
+           "umull v13.8h, v3.8b, v5.8b\\n"
+           "umull v14.8h, v3.8b, v6.8b\\n"
+           "umull v15.8h, v3.8b, v7.8b\\n"
+
+           // Accumulate
+           "uadalp v24.4s, v8.8h\\n"
+           "uadalp v25.4s, v9.8h\\n"
+           "uadalp v26.4s, v10.8h\\n"
+           "uadalp v27.4s, v11.8h\\n"
+           "uadalp v28.4s, v12.8h\\n"
+           "uadalp v29.4s, v13.8h\\n"
+           "uadalp v30.4s, v14.8h\\n"
+           "uadalp v31.4s, v15.8h\\n"
+
+           // Higher part of a2 * {b0,b1,b2,b3}
+           "umull2 v8.8h, v2.16b, v4.16b\\n"
+           "umull2 v9.8h, v2.16b, v5.16b\\n"
+           "umull2 v10.8h, v2.16b, v6.16b\\n"
+           "umull2 v11.8h, v2.16b, v7.16b\\n"
+
+           // Higher part of a3 * {b0,b1,b2,b3}
+           "umull2 v12.8h, v3.16b, v4.16b\\n"
+           "umull2 v13.8h, v3.16b, v5.16b\\n"
+           "umull2 v14.8h, v3.16b, v6.16b\\n"
+           "umull2 v15.8h, v3.16b, v7.16b\\n"
+
+           // Accumulate again
+           "uadalp v24.4s, v8.8h\\n"
+           "uadalp v25.4s, v9.8h\\n"
+           "uadalp v26.4s, v10.8h\\n"
+           "uadalp v27.4s, v11.8h\\n"
+           "uadalp v28.4s, v12.8h\\n"
+           "uadalp v29.4s, v13.8h\\n"
+           "uadalp v30.4s, v14.8h\\n"
+           "uadalp v31.4s, v15.8h\\n"
+    """
+
+def gemm_quantized_4_4_interleaved():

Review comment:
       Ignore. I find it out and this is guarded in the `is_aarch64_arm`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] FrozenGene commented on a change in pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
FrozenGene commented on a change in pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#discussion_r474606293



##########
File path: python/tvm/topi/arm_cpu/conv2d_int8.py
##########
@@ -142,6 +142,7 @@ def schedule_conv2d_NHWC_quantized(cfg, outs):
     n, h, w, c = out.op.axis
     outer, inner = s[out].split(c, 4)
     s[out].vectorize(inner)
+    s[out].parallel(h)

Review comment:
       let us fuse n and h to support possible multi batches




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] FrozenGene commented on pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
FrozenGene commented on pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#issuecomment-680688614


   Thanks @giuseros @anijain2305 It is merged now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] giuseros commented on pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
giuseros commented on pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#issuecomment-662737470


   cc @FrozenGene @anijain2305 @u99127 
   
   Please note that since I will be off from Friday (for 15 days), I might turn this into a draft and pick it up when I come back (since it is self-contained it should not be an issue)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] giuseros commented on pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
giuseros commented on pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#issuecomment-672767207


   Hi @anijain2305 , I just got back from holidays and ready for reviewing this !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] giuseros commented on pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
giuseros commented on pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#issuecomment-675597084


   Hi @FrozenGene , @anijain2305 , 
   Any update on this? 
   
   Thanks a lot,
   Giuseppe


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] FrozenGene commented on a change in pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
FrozenGene commented on a change in pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#discussion_r477064579



##########
File path: python/tvm/topi/arm_cpu/tensor_intrin.py
##########
@@ -21,7 +21,186 @@
 from tvm import te
 from tvm.contrib import util, clang
 
-def gemv_quantized_impl(M, N, data_type='uint8'):
+def gemm_quantized_4_4_batched():
+    return """
+           // First half
+           // Higher part of a0 * {b0,b1,b2,b3}
+           "umull v8.8h, v0.8b, v4.8b\\n"
+           "umull v9.8h, v0.8b, v5.8b\\n"
+           "umull v10.8h, v0.8b, v6.8b\\n"
+           "umull v11.8h, v0.8b, v7.8b\\n"
+
+           // Higher part of a1 * {b0,b1,b2,b3}
+           "umull v12.8h, v1.8b, v4.8b\\n"
+           "umull v13.8h, v1.8b, v5.8b\\n"
+           "umull v14.8h, v1.8b, v6.8b\\n"
+           "umull v15.8h, v1.8b, v7.8b\\n"
+
+           // Accumulate
+           "uadalp v16.4s, v8.8h\\n"
+           "uadalp v17.4s, v9.8h\\n"
+           "uadalp v18.4s, v10.8h\\n"
+           "uadalp v19.4s, v11.8h\\n"
+           "uadalp v20.4s, v12.8h\\n"
+           "uadalp v21.4s, v13.8h\\n"
+           "uadalp v22.4s, v14.8h\\n"
+           "uadalp v23.4s, v15.8h\\n"
+
+           // Lower part of a0 * {b0,b1,b2,b3}
+           "umull2 v8.8h, v0.16b, v4.16b\\n"
+           "umull2 v9.8h, v0.16b, v5.16b\\n"
+           "umull2 v10.8h, v0.16b, v6.16b\\n"
+           "umull2 v11.8h, v0.16b, v7.16b\\n"
+
+           // Lower part of a1 * {b0,b1,b2,b3}
+           "umull2 v12.8h, v1.16b, v4.16b\\n"
+           "umull2 v13.8h, v1.16b, v5.16b\\n"
+           "umull2 v14.8h, v1.16b, v6.16b\\n"
+           "umull2 v15.8h, v1.16b, v7.16b\\n"
+
+            // Accumulate again
+           "uadalp v16.4s, v8.8h\\n"
+           "uadalp v17.4s, v9.8h\\n"
+           "uadalp v18.4s, v10.8h\\n"
+           "uadalp v19.4s, v11.8h\\n"
+           "uadalp v20.4s, v12.8h\\n"
+           "uadalp v21.4s, v13.8h\\n"
+           "uadalp v22.4s, v14.8h\\n"
+           "uadalp v23.4s, v15.8h\\n"
+
+           // Second half
+           // Lower part of a2 * {b0,b1,b2,b3}
+           "umull v8.8h, v2.8b, v4.8b\\n"
+           "umull v9.8h, v2.8b, v5.8b\\n"
+           "umull v10.8h, v2.8b, v6.8b\\n"
+           "umull v11.8h, v2.8b, v7.8b\\n"
+
+           // Lower part of a3 * {b0,b1,b2,b3}
+           "umull v12.8h, v3.8b, v4.8b\\n"
+           "umull v13.8h, v3.8b, v5.8b\\n"
+           "umull v14.8h, v3.8b, v6.8b\\n"
+           "umull v15.8h, v3.8b, v7.8b\\n"
+
+           // Accumulate
+           "uadalp v24.4s, v8.8h\\n"
+           "uadalp v25.4s, v9.8h\\n"
+           "uadalp v26.4s, v10.8h\\n"
+           "uadalp v27.4s, v11.8h\\n"
+           "uadalp v28.4s, v12.8h\\n"
+           "uadalp v29.4s, v13.8h\\n"
+           "uadalp v30.4s, v14.8h\\n"
+           "uadalp v31.4s, v15.8h\\n"
+
+           // Higher part of a2 * {b0,b1,b2,b3}
+           "umull2 v8.8h, v2.16b, v4.16b\\n"
+           "umull2 v9.8h, v2.16b, v5.16b\\n"
+           "umull2 v10.8h, v2.16b, v6.16b\\n"
+           "umull2 v11.8h, v2.16b, v7.16b\\n"
+
+           // Higher part of a3 * {b0,b1,b2,b3}
+           "umull2 v12.8h, v3.16b, v4.16b\\n"
+           "umull2 v13.8h, v3.16b, v5.16b\\n"
+           "umull2 v14.8h, v3.16b, v6.16b\\n"
+           "umull2 v15.8h, v3.16b, v7.16b\\n"
+
+           // Accumulate again
+           "uadalp v24.4s, v8.8h\\n"
+           "uadalp v25.4s, v9.8h\\n"
+           "uadalp v26.4s, v10.8h\\n"
+           "uadalp v27.4s, v11.8h\\n"
+           "uadalp v28.4s, v12.8h\\n"
+           "uadalp v29.4s, v13.8h\\n"
+           "uadalp v30.4s, v14.8h\\n"
+           "uadalp v31.4s, v15.8h\\n"
+    """
+
+def gemm_quantized_4_4_interleaved():

Review comment:
       One last question, do we have target guard to make sure this will only effect on aarch64? Like `umull` doesn't exist on arm32.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] giuseros commented on pull request #6117: Use auto-tuner to improve conv2d_gemm performance

Posted by GitBox <gi...@apache.org>.
giuseros commented on pull request #6117:
URL: https://github.com/apache/incubator-tvm/pull/6117#issuecomment-680226417


   Hi @FrozenGene , 
   Any update on this?
   
   Thanks again,
   Giuseppe


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org