You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/09/28 11:23:23 UTC

[GitHub] [tvm-rfcs] ekalda opened a new pull request, #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

ekalda opened a new pull request, #94:
URL: https://github.com/apache/tvm-rfcs/pull/94

   This RFC is to add CodeGenAArch64 backend with SVE.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

Posted by GitBox <gi...@apache.org>.

tqchen commented on PR #94:
URL: https://github.com/apache/tvm-rfcs/pull/94#issuecomment-1264656688

   Thanks @ekalda . It is great to see us having conversations on bringing in SVE. The main question we want to resolve likely is going to be **what is the TIR spec goes into codegen that contains SVE info**.
   
   Three alternatives have been discussed so far:
   
   ### A0: Loop with annotation but body as scalar
   
   ```python
     for (i: int32, 0, 20;i, annotation={"VLA"}) {
       C_2[i] = A_2[i] + B_2[i];
     }
   ```
   ### A1: Vectorized loop with constant vector factor 
   
   ```python
     for (i: int32, 0, 20; i) {
       C_2[ramp(i, 0, 5)] = A_2[ramp(i, 0, 5)] + B_2[ramp(i, 0, 5)];
     }
   ```
   
   ### A2: Vectorized loop with some form of TIR repr for sve vector
   
   ```python
     for (i: int32, 0, 20; i) {
       C_2[ramp(i, 0, vscale)] = A_2[ramp(i, 0, vscale)] + B_2[ramp(i, 0, vscale)];
     }
   ```
   
   This would involve updates to the ramp note TIR. See ```kScalableVectorLaneMark``` comment in [previous discussion](https://github.com/apache/tvm-rfcs/pull/18)
   
   ## Discussion
   The above three perspective are to setup the stage for discussion. This RFC proposes A1. 
   
   Because it is a proposed change to codegen only, which does not change TIR. If A1 can be implemented correctly, then it think it is a positive step(close to S0 type change we had in other conversations) even if we want to do things in several stages(with follow up S1 changes).
   
   The main question of  discussion is how can we implement A1 robustly.  
   
   Since turning a specialized code into general one is a bit like raising (from special case to general ones). It would be good to add high-level description about the pattern match and conversation rules.  For some background, initially I thought that there might be some traps when the code contains some specializations to lane, but thinking a bit more I find my initial thought of counter example actually is fine under A1. So I am more convinced of this approach. 
   
   
   Something around the following:
   
   We would only turn SVE specialization if the code satisfies the following pattern
   
   - Pattern match all ramped load/store `A[ramp(iter*lanes, 0, lanes)]` to ensure they have same lanes, change lane to VL with predication
   - Change the outer loop iter to vector loop.
   - If there is a vector/load that does not satisfy the pattern, we abort.
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] leandron merged pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

Posted by GitBox <gi...@apache.org>.

leandron merged PR #94:
URL: https://github.com/apache/tvm-rfcs/pull/94


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] leandron commented on pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

Posted by GitBox <gi...@apache.org>.

leandron commented on PR #94:
URL: https://github.com/apache/tvm-rfcs/pull/94#issuecomment-1279105928

   Thanks @tqchen @ekalda. This is been up for a few days, and getting no new questions, so I'm merging it and we'll continue with the work towards what's described in the RFC.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

Posted by GitBox <gi...@apache.org>.

tqchen commented on PR #94:
URL: https://github.com/apache/tvm-rfcs/pull/94#issuecomment-1276158707

   Thanks @ekalda i don't have further comments at this pt


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on a diff in pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

Posted by GitBox <gi...@apache.org>.

tqchen commented on code in PR #94:
URL: https://github.com/apache/tvm-rfcs/pull/94#discussion_r985240243


##########
rfcs/0094-aarch64-backend-with-sve.md:
##########
@@ -0,0 +1,140 @@
+- Feature Name: aarch64_backend
+- Start Date: 2022-09-26
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@manupak](https://github.com/manupak), [@u99127](https://github.com/u99127)
+
+# Summary
+
+This RFC is to introduce a new TIR backend for AArch64 codegen for supporting target specific features, specifically SVE. Currently AArch64 specific code is generated either through a generic LLVM backend or by tensorize implementation (e.g. the MMLA Arm(R) Neon(TM) instruction), but we could see a benefit from having a more fine grained control over LLVM that targets AArch64.
+
+# Motivation
+
+The main motivation behind this work is to introduce SVE instructions in codegen without changing IRs, scheduling primitives or TVM passes. AArch64 backend would be a good place to work around the issues in LLVM SVE code generation that have surfaced while adding support for SVE in Halide. In addition, `CodegenAArch64` backend would not be limited to SVE codegen – it could be used to introduce AArch64 specific lowering where required, either for specialised use of AArch64 intrinsics or to work around limitations of LLVM.
+
+# Guide-level explanation
+
+In comparison to the Arm(R) Neon(TM) instruction set, which uses a fixed vector length, SVE allows the developer to write vectorized code where the exact length of a vector is unknown at a compile time. That code can then run on hardware implementations with different choices of vector lengths. For hardware implementations, the only constraint for the vector length is that it has to be minimum of 128 bits and it has to be a multiple of 128 bits. *Vscale* is the number of sets of 128 bits that fit into the SVE vector, e.g. vscale of 4 results in a vector length of 512 bits.
+
+The initial SVE implementation in TVM would focus on two main capabilities of SVE:
+
+**1. Vector length agnostic loops**
+As an example, consider this vector length agnostic loop that adds two vectors with FP32 elements:
+
+```
+for (i = 0; i < n; i += 4 * vscale)
+    c[i : i + 4 * vscale] = a[i : i + 4 * vscale] + b[i : i + 4 * vscale]
+```
+
+Number 4 in the above example comes from the fact that we can fit four FP32 elements into 128 bits. Here the number of times we have to run the loop will depend on vscale, which is a hardware implementation detail. If the vector length was, as an example, 256 bits, we could process 8 FP32 elements in one iteration, meaning we would have to do `n / 8` iterations. By increasing the vector length to 512 bits, we would need to do `n / 16` iterations.
+
+**2. Predication**
+SVE provides support for predication, enabling us to efficiently deal with loop tails, among other things. In the example above, `n` may or may not be a multiple of `4 * vscale`. Predication allows us to handle this loop without any special consideration for the remainder of the elements i.e. `c[n - n % (4 * vscale) : n]`. Essentially, every operation with SVE registers would take a predicate register as one of its arguments that would act as a bit mask indicating which elements are active. Similarly to the vector length, the length of a predicate depends on the hardware implementation.
+
+```
+whilelt p0.s, w17, w12
+ld1w    { z0.s }, p0/z, [x2, x17, lsl #2]
+ld1w    { z1.s }, p0/z, [x1, x17, lsl #2]
+fadd    z2.s , z0.s , z1.s
+st1w    { z2.s }, p0, [x0, x17, lsl #2]
+```
+
+In that example, `whilelt` constructs the predicate register `p0` based on the loop bound variable and the increment variable stored in `w` registers.
+
+## How to target AArch64 backend
+
+Similarly to how we target other LLVM codegen backends, we would invoke AArch64 backend through parsing the `-mtriple` in the target string:
+
+```
+target = "llvm -mtriple=aarch64-gnu-linux -mattr=+sve"
+```
+
+The node visitors in the AArch64 backend implementation would generate SVE code when `+sve` is part of the `-mattr`.
+
+# Reference-level explanation
+
+The main difference compared to CodegenLLVM would be how we generate llvm and assembly for `Ramp` and `Broadcast` nodes.
+
+Let's take a simple vectorized addition of two dimensional tensors as an example:
+
+```
+A = te.placeholder((200, 200), name="A")
+B = te.placeholder((200, 200), name="B")
+T = te.compute((200, 200), lambda i, j: A[i, j] + B[i, j])
+
+s = te.create_schedule(T.op)
+xo, yo, xi, yi = s[T].tile(T.op.axis[0], T.op.axis[1], x_factor=10, y_factor=5)
+                                                                    # ^^ this would be the vector length
+s[T].vectorize(yi)
+```
+
+Currently, loops that are annotated with vectorize will be represented as `Ramp` nodes in TIR:
+
+```
+@main = primfn(A_1: handle, B_1: handle, m: int32, n: int32) -> ()
+  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
+  buffers = {A: Buffer(A_2: Pointer(float32), float32, [200, 200], []),
+             B: Buffer(B_2: Pointer(float32), float32, [200, 200], [])}
+  buffer_map = {A_1: A, B_1: B} {
+  realize(compute: Buffer(compute_1: Pointer(float32), float32, [200, 200], []), [0:200, 0:200], True {
+    for (i.outer: int32, 0, 20) {
+      for (j.outer: int32, 0, 40) {
+        for (i.inner: int32, 0, 10) "unroll" {
+          compute[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] = (A[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] + B[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)])
+        }
+      }
+    }
+  })
+}
+```
+
+The above TIR segment contains static numbers as the lane count (5) and the inferred bound (40) across the `j` axis. If SVE is used, the AArch64 backend would treat the lane count as `llvm.vscale() * 4` and the corresponding loop bound as `ceil( 40 / llvm.vscale() * 4 )`.
+
+With SVE enabled, this TIR would further be lowered to LLVM:
+
+```

Review Comment:
   Based on this description, seems the proposed approach is that:
   -  we pattern matching a fixed vectorization( lane=5)
   - raise it back to SVE pattern (with vscale and lane!=5)
   - codegen  
   
   One concern is that the code can be simplified by the assumption(lane=5) during lowering phase, but that simplification does not work for the general case.
   
   Edit: After thinking a bit more, i now think the above concern can be addressed by clarifying a strict set of raising rules. so feel free to ignore this



##########
rfcs/0094-aarch64-backend-with-sve.md:
##########
@@ -0,0 +1,140 @@
+- Feature Name: aarch64_backend
+- Start Date: 2022-09-26
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@manupak](https://github.com/manupak), [@u99127](https://github.com/u99127)
+
+# Summary
+
+This RFC is to introduce a new TIR backend for AArch64 codegen for supporting target specific features, specifically SVE. Currently AArch64 specific code is generated either through a generic LLVM backend or by tensorize implementation (e.g. the MMLA Arm(R) Neon(TM) instruction), but we could see a benefit from having a more fine grained control over LLVM that targets AArch64.
+
+# Motivation
+
+The main motivation behind this work is to introduce SVE instructions in codegen without changing IRs, scheduling primitives or TVM passes. AArch64 backend would be a good place to work around the issues in LLVM SVE code generation that have surfaced while adding support for SVE in Halide. In addition, `CodegenAArch64` backend would not be limited to SVE codegen – it could be used to introduce AArch64 specific lowering where required, either for specialised use of AArch64 intrinsics or to work around limitations of LLVM.
+
+# Guide-level explanation
+
+In comparison to the Arm(R) Neon(TM) instruction set, which uses a fixed vector length, SVE allows the developer to write vectorized code where the exact length of a vector is unknown at a compile time. That code can then run on hardware implementations with different choices of vector lengths. For hardware implementations, the only constraint for the vector length is that it has to be minimum of 128 bits and it has to be a multiple of 128 bits. *Vscale* is the number of sets of 128 bits that fit into the SVE vector, e.g. vscale of 4 results in a vector length of 512 bits.
+
+The initial SVE implementation in TVM would focus on two main capabilities of SVE:
+
+**1. Vector length agnostic loops**
+As an example, consider this vector length agnostic loop that adds two vectors with FP32 elements:
+
+```
+for (i = 0; i < n; i += 4 * vscale)
+    c[i : i + 4 * vscale] = a[i : i + 4 * vscale] + b[i : i + 4 * vscale]
+```
+
+Number 4 in the above example comes from the fact that we can fit four FP32 elements into 128 bits. Here the number of times we have to run the loop will depend on vscale, which is a hardware implementation detail. If the vector length was, as an example, 256 bits, we could process 8 FP32 elements in one iteration, meaning we would have to do `n / 8` iterations. By increasing the vector length to 512 bits, we would need to do `n / 16` iterations.
+
+**2. Predication**
+SVE provides support for predication, enabling us to efficiently deal with loop tails, among other things. In the example above, `n` may or may not be a multiple of `4 * vscale`. Predication allows us to handle this loop without any special consideration for the remainder of the elements i.e. `c[n - n % (4 * vscale) : n]`. Essentially, every operation with SVE registers would take a predicate register as one of its arguments that would act as a bit mask indicating which elements are active. Similarly to the vector length, the length of a predicate depends on the hardware implementation.
+
+```
+whilelt p0.s, w17, w12
+ld1w    { z0.s }, p0/z, [x2, x17, lsl #2]
+ld1w    { z1.s }, p0/z, [x1, x17, lsl #2]
+fadd    z2.s , z0.s , z1.s
+st1w    { z2.s }, p0, [x0, x17, lsl #2]
+```
+
+In that example, `whilelt` constructs the predicate register `p0` based on the loop bound variable and the increment variable stored in `w` registers.
+
+## How to target AArch64 backend
+
+Similarly to how we target other LLVM codegen backends, we would invoke AArch64 backend through parsing the `-mtriple` in the target string:
+
+```
+target = "llvm -mtriple=aarch64-gnu-linux -mattr=+sve"
+```
+
+The node visitors in the AArch64 backend implementation would generate SVE code when `+sve` is part of the `-mattr`.
+
+# Reference-level explanation
+
+The main difference compared to CodegenLLVM would be how we generate llvm and assembly for `Ramp` and `Broadcast` nodes.
+
+Let's take a simple vectorized addition of two dimensional tensors as an example:
+
+```
+A = te.placeholder((200, 200), name="A")
+B = te.placeholder((200, 200), name="B")
+T = te.compute((200, 200), lambda i, j: A[i, j] + B[i, j])
+
+s = te.create_schedule(T.op)
+xo, yo, xi, yi = s[T].tile(T.op.axis[0], T.op.axis[1], x_factor=10, y_factor=5)
+                                                                    # ^^ this would be the vector length
+s[T].vectorize(yi)
+```
+
+Currently, loops that are annotated with vectorize will be represented as `Ramp` nodes in TIR:
+
+```
+@main = primfn(A_1: handle, B_1: handle, m: int32, n: int32) -> ()
+  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
+  buffers = {A: Buffer(A_2: Pointer(float32), float32, [200, 200], []),
+             B: Buffer(B_2: Pointer(float32), float32, [200, 200], [])}
+  buffer_map = {A_1: A, B_1: B} {
+  realize(compute: Buffer(compute_1: Pointer(float32), float32, [200, 200], []), [0:200, 0:200], True {
+    for (i.outer: int32, 0, 20) {
+      for (j.outer: int32, 0, 40) {
+        for (i.inner: int32, 0, 10) "unroll" {
+          compute[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] = (A[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] + B[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)])
+        }
+      }
+    }
+  })
+}
+```
+
+The above TIR segment contains static numbers as the lane count (5) and the inferred bound (40) across the `j` axis. If SVE is used, the AArch64 backend would treat the lane count as `llvm.vscale() * 4` and the corresponding loop bound as `ceil( 40 / llvm.vscale() * 4 )`.
+
+With SVE enabled, this TIR would further be lowered to LLVM:
+
+```

Review Comment:
   As an alternative, it be possible to directly generate from a non-vectorized spec? So the question is that if we already are in this loop with VLA annotation, presumably the cost of pattern matching is similar? 
   
   ```c++
     for (i: int32, 0, 17;i, annotation={"VLA"}) {
       C_2[i] = A_2[i] + B_2[i];
     }
   ```
   And we will be defering the vectorized instruction generation to the codegen phase, by specially handling the patterns in the for that is annotated with VLA loop. Of course we can only support a limited set of patterns(such as read/write to the same vector index or limited reduction support), that is why legalize is needed to make sure the body of VLA for loop satiesfies the pattern.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] ekalda commented on pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

Posted by GitBox <gi...@apache.org>.

ekalda commented on PR #94:
URL: https://github.com/apache/tvm-rfcs/pull/94#issuecomment-1275917969

Thanks for your input and suggestions @tqchen, much appreciated! I added a paragraph about pattern matching TIR, see if it makes sense.

Yes, this RFC propses A1 change. A2 style TIR intrinsic is in the plan further down the line, it would let us expose SVE capabilities to the core compiler, so we could explore a larger space of optimisations. The decision to enable it initially just in the TIR->LLVM boundary came from a realisation that we can generate perfectly valid SVE from just looking at the TIR, without having to modify it.

I have spent some time playing around with the current LLVM codegen and I think you make a very good point with the robustness. I have been looking at simple vectorized loads and stores (simple meaning here that the stride is 1 and that the index expression is a Ramp node, not a complex non-linear calculation with Ramp as a leaf node), the main challenge I currently see is that while the index itself is 1D at the point of code generation, the loop nest necessarily isn't, so I have to figure out the right loop bound that needs changing from the base of the Ramp node. It seems to me that we have to do some sort of analysis pass just before the codegen to collect that info. It would have been nice to directly generate the SVE LLVM "as we go" during the LLVM codegen, but it seems that we generate LLVM with the loop bounds fixed before we visit the loop body (so before we discover the Ramp nodes) and we can't change the bound afterwards. I think doing an analysis pass would help with
the robustness since we can gather as much information from the TIR graph as we need to.

I haven't worked a lot with LLVM backends, so interested in hearing any thoughts/suggestions.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] ekalda commented on pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

Posted by GitBox <gi...@apache.org>.

ekalda commented on PR #94:
URL: https://github.com/apache/tvm-rfcs/pull/94#issuecomment-1260767491

   There is more context around where this is going in the [meta-RFC](https://discuss.tvm.apache.org/t/meta-rfc-vector-length-agnostic-vla-vectorization/13596) :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on a diff in pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

Posted by GitBox <gi...@apache.org>.

tqchen commented on code in PR #94:
URL: https://github.com/apache/tvm-rfcs/pull/94#discussion_r985240243


##########
rfcs/0094-aarch64-backend-with-sve.md:
##########
@@ -0,0 +1,140 @@
+- Feature Name: aarch64_backend
+- Start Date: 2022-09-26
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@manupak](https://github.com/manupak), [@u99127](https://github.com/u99127)
+
+# Summary
+
+This RFC is to introduce a new TIR backend for AArch64 codegen for supporting target specific features, specifically SVE. Currently AArch64 specific code is generated either through a generic LLVM backend or by tensorize implementation (e.g. the MMLA Arm(R) Neon(TM) instruction), but we could see a benefit from having a more fine grained control over LLVM that targets AArch64.
+
+# Motivation
+
+The main motivation behind this work is to introduce SVE instructions in codegen without changing IRs, scheduling primitives or TVM passes. AArch64 backend would be a good place to work around the issues in LLVM SVE code generation that have surfaced while adding support for SVE in Halide. In addition, `CodegenAArch64` backend would not be limited to SVE codegen – it could be used to introduce AArch64 specific lowering where required, either for specialised use of AArch64 intrinsics or to work around limitations of LLVM.
+
+# Guide-level explanation
+
+In comparison to the Arm(R) Neon(TM) instruction set, which uses a fixed vector length, SVE allows the developer to write vectorized code where the exact length of a vector is unknown at a compile time. That code can then run on hardware implementations with different choices of vector lengths. For hardware implementations, the only constraint for the vector length is that it has to be minimum of 128 bits and it has to be a multiple of 128 bits. *Vscale* is the number of sets of 128 bits that fit into the SVE vector, e.g. vscale of 4 results in a vector length of 512 bits.
+
+The initial SVE implementation in TVM would focus on two main capabilities of SVE:
+
+**1. Vector length agnostic loops**
+As an example, consider this vector length agnostic loop that adds two vectors with FP32 elements:
+
+```
+for (i = 0; i < n; i += 4 * vscale)
+    c[i : i + 4 * vscale] = a[i : i + 4 * vscale] + b[i : i + 4 * vscale]
+```
+
+Number 4 in the above example comes from the fact that we can fit four FP32 elements into 128 bits. Here the number of times we have to run the loop will depend on vscale, which is a hardware implementation detail. If the vector length was, as an example, 256 bits, we could process 8 FP32 elements in one iteration, meaning we would have to do `n / 8` iterations. By increasing the vector length to 512 bits, we would need to do `n / 16` iterations.
+
+**2. Predication**
+SVE provides support for predication, enabling us to efficiently deal with loop tails, among other things. In the example above, `n` may or may not be a multiple of `4 * vscale`. Predication allows us to handle this loop without any special consideration for the remainder of the elements i.e. `c[n - n % (4 * vscale) : n]`. Essentially, every operation with SVE registers would take a predicate register as one of its arguments that would act as a bit mask indicating which elements are active. Similarly to the vector length, the length of a predicate depends on the hardware implementation.
+
+```
+whilelt p0.s, w17, w12
+ld1w    { z0.s }, p0/z, [x2, x17, lsl #2]
+ld1w    { z1.s }, p0/z, [x1, x17, lsl #2]
+fadd    z2.s , z0.s , z1.s
+st1w    { z2.s }, p0, [x0, x17, lsl #2]
+```
+
+In that example, `whilelt` constructs the predicate register `p0` based on the loop bound variable and the increment variable stored in `w` registers.
+
+## How to target AArch64 backend
+
+Similarly to how we target other LLVM codegen backends, we would invoke AArch64 backend through parsing the `-mtriple` in the target string:
+
+```
+target = "llvm -mtriple=aarch64-gnu-linux -mattr=+sve"
+```
+
+The node visitors in the AArch64 backend implementation would generate SVE code when `+sve` is part of the `-mattr`.
+
+# Reference-level explanation
+
+The main difference compared to CodegenLLVM would be how we generate llvm and assembly for `Ramp` and `Broadcast` nodes.
+
+Let's take a simple vectorized addition of two dimensional tensors as an example:
+
+```
+A = te.placeholder((200, 200), name="A")
+B = te.placeholder((200, 200), name="B")
+T = te.compute((200, 200), lambda i, j: A[i, j] + B[i, j])
+
+s = te.create_schedule(T.op)
+xo, yo, xi, yi = s[T].tile(T.op.axis[0], T.op.axis[1], x_factor=10, y_factor=5)
+                                                                    # ^^ this would be the vector length
+s[T].vectorize(yi)
+```
+
+Currently, loops that are annotated with vectorize will be represented as `Ramp` nodes in TIR:
+
+```
+@main = primfn(A_1: handle, B_1: handle, m: int32, n: int32) -> ()
+  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
+  buffers = {A: Buffer(A_2: Pointer(float32), float32, [200, 200], []),
+             B: Buffer(B_2: Pointer(float32), float32, [200, 200], [])}
+  buffer_map = {A_1: A, B_1: B} {
+  realize(compute: Buffer(compute_1: Pointer(float32), float32, [200, 200], []), [0:200, 0:200], True {
+    for (i.outer: int32, 0, 20) {
+      for (j.outer: int32, 0, 40) {
+        for (i.inner: int32, 0, 10) "unroll" {
+          compute[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] = (A[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] + B[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)])
+        }
+      }
+    }
+  })
+}
+```
+
+The above TIR segment contains static numbers as the lane count (5) and the inferred bound (40) across the `j` axis. If SVE is used, the AArch64 backend would treat the lane count as `llvm.vscale() * 4` and the corresponding loop bound as `ceil( 40 / llvm.vscale() * 4 )`.
+
+With SVE enabled, this TIR would further be lowered to LLVM:
+
+```

Review Comment:
   Based on this description, seems the proposed approach is that:
   -  we pattern matching a fixed vectorization( lane=5)
   - raise it back to SVE pattern (with vscale and lane!=5)
   - codegen  
   
   One concern is that the code can be simplified by the assumption(lane=5) during lowering phase, but that simplification does not work for the general case.



##########
rfcs/0094-aarch64-backend-with-sve.md:
##########
@@ -0,0 +1,140 @@
+- Feature Name: aarch64_backend
+- Start Date: 2022-09-26
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@manupak](https://github.com/manupak), [@u99127](https://github.com/u99127)
+
+# Summary
+
+This RFC is to introduce a new TIR backend for AArch64 codegen for supporting target specific features, specifically SVE. Currently AArch64 specific code is generated either through a generic LLVM backend or by tensorize implementation (e.g. the MMLA Arm(R) Neon(TM) instruction), but we could see a benefit from having a more fine grained control over LLVM that targets AArch64.
+
+# Motivation
+
+The main motivation behind this work is to introduce SVE instructions in codegen without changing IRs, scheduling primitives or TVM passes. AArch64 backend would be a good place to work around the issues in LLVM SVE code generation that have surfaced while adding support for SVE in Halide. In addition, `CodegenAArch64` backend would not be limited to SVE codegen – it could be used to introduce AArch64 specific lowering where required, either for specialised use of AArch64 intrinsics or to work around limitations of LLVM.
+
+# Guide-level explanation
+
+In comparison to the Arm(R) Neon(TM) instruction set, which uses a fixed vector length, SVE allows the developer to write vectorized code where the exact length of a vector is unknown at a compile time. That code can then run on hardware implementations with different choices of vector lengths. For hardware implementations, the only constraint for the vector length is that it has to be minimum of 128 bits and it has to be a multiple of 128 bits. *Vscale* is the number of sets of 128 bits that fit into the SVE vector, e.g. vscale of 4 results in a vector length of 512 bits.
+
+The initial SVE implementation in TVM would focus on two main capabilities of SVE:
+
+**1. Vector length agnostic loops**
+As an example, consider this vector length agnostic loop that adds two vectors with FP32 elements:
+
+```
+for (i = 0; i < n; i += 4 * vscale)
+    c[i : i + 4 * vscale] = a[i : i + 4 * vscale] + b[i : i + 4 * vscale]
+```
+
+Number 4 in the above example comes from the fact that we can fit four FP32 elements into 128 bits. Here the number of times we have to run the loop will depend on vscale, which is a hardware implementation detail. If the vector length was, as an example, 256 bits, we could process 8 FP32 elements in one iteration, meaning we would have to do `n / 8` iterations. By increasing the vector length to 512 bits, we would need to do `n / 16` iterations.
+
+**2. Predication**
+SVE provides support for predication, enabling us to efficiently deal with loop tails, among other things. In the example above, `n` may or may not be a multiple of `4 * vscale`. Predication allows us to handle this loop without any special consideration for the remainder of the elements i.e. `c[n - n % (4 * vscale) : n]`. Essentially, every operation with SVE registers would take a predicate register as one of its arguments that would act as a bit mask indicating which elements are active. Similarly to the vector length, the length of a predicate depends on the hardware implementation.
+
+```
+whilelt p0.s, w17, w12
+ld1w    { z0.s }, p0/z, [x2, x17, lsl #2]
+ld1w    { z1.s }, p0/z, [x1, x17, lsl #2]
+fadd    z2.s , z0.s , z1.s
+st1w    { z2.s }, p0, [x0, x17, lsl #2]
+```
+
+In that example, `whilelt` constructs the predicate register `p0` based on the loop bound variable and the increment variable stored in `w` registers.
+
+## How to target AArch64 backend
+
+Similarly to how we target other LLVM codegen backends, we would invoke AArch64 backend through parsing the `-mtriple` in the target string:
+
+```
+target = "llvm -mtriple=aarch64-gnu-linux -mattr=+sve"
+```
+
+The node visitors in the AArch64 backend implementation would generate SVE code when `+sve` is part of the `-mattr`.
+
+# Reference-level explanation
+
+The main difference compared to CodegenLLVM would be how we generate llvm and assembly for `Ramp` and `Broadcast` nodes.
+
+Let's take a simple vectorized addition of two dimensional tensors as an example:
+
+```
+A = te.placeholder((200, 200), name="A")
+B = te.placeholder((200, 200), name="B")
+T = te.compute((200, 200), lambda i, j: A[i, j] + B[i, j])
+
+s = te.create_schedule(T.op)
+xo, yo, xi, yi = s[T].tile(T.op.axis[0], T.op.axis[1], x_factor=10, y_factor=5)
+                                                                    # ^^ this would be the vector length
+s[T].vectorize(yi)
+```
+
+Currently, loops that are annotated with vectorize will be represented as `Ramp` nodes in TIR:
+
+```
+@main = primfn(A_1: handle, B_1: handle, m: int32, n: int32) -> ()
+  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
+  buffers = {A: Buffer(A_2: Pointer(float32), float32, [200, 200], []),
+             B: Buffer(B_2: Pointer(float32), float32, [200, 200], [])}
+  buffer_map = {A_1: A, B_1: B} {
+  realize(compute: Buffer(compute_1: Pointer(float32), float32, [200, 200], []), [0:200, 0:200], True {
+    for (i.outer: int32, 0, 20) {
+      for (j.outer: int32, 0, 40) {
+        for (i.inner: int32, 0, 10) "unroll" {
+          compute[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] = (A[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] + B[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)])
+        }
+      }
+    }
+  })
+}
+```
+
+The above TIR segment contains static numbers as the lane count (5) and the inferred bound (40) across the `j` axis. If SVE is used, the AArch64 backend would treat the lane count as `llvm.vscale() * 4` and the corresponding loop bound as `ceil( 40 / llvm.vscale() * 4 )`.
+
+With SVE enabled, this TIR would further be lowered to LLVM:
+
+```

Review Comment:
   As an alternative, it be possible to directly generate from a non-vectorized spec? So the question is that if we already are in this loop with VLA annotation, presumably the cost of pattern matching is similar? 
   
   ```c++
     for (i: int32, 0, 17;i, annotation={"VLA"}) {
       C_2[i] = A_2[i] + B_2[i];
     }
   ```
   And we will be defering the vectorized instruction generation to the codegen phase, by specially handling the patterns in the for that is annotated with VLA loop. Of course we can only support a limited set of patterns(such as read/write to the same vector index or limited reduction support), that is why legalize is needed to make sure the body of VLA for loop satiesfies the pattern.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org