You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/09/09 23:27:55 UTC

[GitHub] [tvm-rfcs] hogepodge commented on a change in pull request #18: [RFC] Adding initial SVE implementation

hogepodge commented on a change in pull request #18:
URL: https://github.com/apache/tvm-rfcs/pull/18#discussion_r705787815



##########
File path: rfcs/0018-initial-sve-addition.md
##########
@@ -0,0 +1,221 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/18
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. Scalable vectorisation is extracting data parallelism from 
+code, but as opposed to a fixed width vectorisation, the vector length is 
+unknown at compile time. A scalable vector's total number of elements is a 
+constant multiple of a specified number of elements. The 
+[LLVM LangRef](https://llvm.org/docs/LangRef.html) refers to this constant 
+multiple as vscale. It is a positive integer that is unknown at compile time, 
+therefore the overall vector length (VL) is also unknown. The value of vscale, 
+and therefore VL, will depend on the architecture that is running the program. 
+More details and an overview of this is given in 
+[this tutorial](https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf), 
+where an example of a daxpy kernel is given from slide 17 onwards. In this RFC, 
+we will show an example of lowering from TE for a (scalable) vector addition 
+kernel all the way down to LLVM IR, further illustrating the vscale concept. 
+We will also cover TIR support and how it affects the LLVM codegen. This is an 
+introductory RFC to see if the design of our prototype implementation, see 
+https://github.com/apache/tvm/pull/8655, is sound and we welcome any feedback 
+on this prosposal.
+
+Before we explain this in more detail, let's first briefly look at the current
+state and terminology with an example. Vectorisation along the x-axis of an
+addition of two one-dimensional tensors A and B of size 18, writing the result
+to C, will result in the following TIR:
+
+```
+C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)]`
+```
+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising
+this example. Vector architecture extensions (e.g. X86 AVX512 or AArch Neon)
+typically allow to pack and operate on a power-of-2 number of elements, so 2,
+4, 8, 16, etc.  elements. If the elements are integers, and a vector register
+is 128-bits wide, we can pack 4 integer elements into one vector register (if
+an integer is 4 bytes). This is an example of fixed width vectorisation,
+because the vector registers have a fixed width of 128-bits. Since we have 18, the
+number of elements in the vectors A, B, and C, is not a multiple of 4, we need
+4 vector operations processing 4 * 4 = 16 elements, and 2 scalar operations are
+required for processing the 16th and 17th elements which we call the scalar
+epilogue.
+
+## Motivation
+
+However, most modern vector architectures (e.g. X86 AVX512 and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in
+vector operations.  This allows us to have just 5 vector operations for our
+example, and importantly no scalar epilogue. But since we do not need to
+process 5 * 4 = 20 elements, the last vector operation only needs to write two
+elements, which can be achieved by predication as we can enable the first two
+lanes and disable the last 2 lanes.
+
+In addition to predication, and also related to it, some new vector 
+architectures also allow scalable vectorisation. As opposed to so called fixed
+width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
+extension allows implementations to choose a vector register length between 128
+and 2048 bits.  It supports a vector length agnostic programming model which
+allows code to run and scale automatically across all vector lengths without
+recompilation.
+
+## Problem Statement
+
+We would like to add support for Arm Architecture's Scalable Vector Extension (SVE) 
+in TVM by introducing features for Vector Length Agnostic (VLA) programs and
+predication, i.e. the 2 main new SVE features. Thus we would like to express
+scalable vectorisation in both TE and TIR. The question is how to achieve that? In
+Tensor Expression language, our example to add two tensors A and B would look
+like this:
+
+```
+n = 17
+A = te.placeholder((n,), name="A", dtype = "int8")
+B = te.placeholder((n,), name="B", dtype = "int8")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+s = te.create_schedule(C.op)
+x, = C.op.axis
+s[C].vectorize(x)
+```
+
+Vectorisation along the x-axis is requested with _vectorize(x)_, and will

Review comment:
       Within markdown, code is typically denoted with a pair`back single quotes` surrounding the code. I suggest using that as the convention to enclose methods, variables, and other code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org