You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/09/30 10:33:20 UTC

[GitHub] [tvm-rfcs] NicolaLancellotti commented on a change in pull request #37: Introduce the Arm(R) Ethos(TM)-U Cascading Planner

NicolaLancellotti commented on a change in pull request #37:
URL: https://github.com/apache/tvm-rfcs/pull/37#discussion_r719273376



##########
File path: rfcs/0037-arm-ethosu-cascading-planner.md
##########
@@ -0,0 +1,151 @@
+- Feature Name: Arm® Ethos™-U Cascading Planner
+- Start Date: 2021-09-22
+- RFC PR: [apache/tvm-rfcs#0037](https://github.com/apache/tvm-rfcs/pull/37)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+This feature builds upon the support introduced into TVM to compile for Arm® Ethos(TM)-U NPUs (see [RFC](https://github.com/apache/tvm-rfcs/pull/11)). In that work, we represent NPU operators using Tensor Expressions, allowing them be scheduled using TVM’s scheduling language. The Planner aims to take advantage of this by deriving scheduling strategies for graphs off-loaded to the NPU which are optimal in both performance and the memory required to run the network. The Planner primarily searches for inter-operator scheduling opportunities rather than intra-operator which is more common in both AutoTVM and the Auto-Scheduler. In particular, it seeks to leverage the technique of ‘cascading’.
+
+# Motivation
+[motivation]: #motivation
+
+On deeply embedded devices, working memory in the form of SRAM is at a premium. It may typically be measured in kilobytes and this severely limits the selection of networks that can be run on such devices. For Arm® Ethos™-U NPUs, this becomes a significant issue as the NPU provides sufficient compute power to run larger models but the system may not have the memory necessary to execute them. The Cascading Planner seeks to alleviate this by rescheduling the networks such that they use less working memory and accordingly can take advantage of the considerable compute capabilities of the NPU.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+The Cascading Planner will act as a scheduler for the Arm(R) Ethos(TM)-U NPU compiler in TVM. This means it will make decisions about the order in which operations are executed, as well as which memories tensors are assigned to and whether they need copying between memories. It's primary objective is to reduce the required working memory needed to run the network until it fits into a user-specified memory budget. The behaviour of the Planner will be controlled by a number of user-configurable options which will ultimately be exposed all the way up to TVMC (exact interface TBD). However, it is expected that for most ordinary use-cases, the Planner will be invisible to the user as it will be configured with sensible defaults.
+
+Additionally, the Planner will rely on a description of the device's memory layout to know the bandwidth and size of the different sections. This allows the Planner to effectively optimize models such that they fit within the memory constraints while still giving good performance. It is expected that this information will be derived from the 'PoolInfo' mechanism that is proposed to be used by the Unified Static Memory Planner ([RFC](https://github.com/apache/tvm-rfcs/pull/9)).
+
+Schedulers for the NPU are required to expose an interface of TE Schedule -> TE Schedule. Note that a side-channelled constant dictionary is required due to the inability to express constants in TE/TIR. Should progress be made in resolving that restriction, this side-channel would be dropped from the design. Refer to the following diagram to see where the Planner fits into the compilation flow for Ethos-U:
+
+![meta-schedule-workflow](../resources/planner-flow.png)
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Cascading
+
+Cascading is the term we use for a form of inter-operator scheduling by which a series of operations are executed together as a sequence of dependent N-dimensional tiles. Those 'N-dimensional' tiles we refer to as 'stripes', corresponding to calculating only part of a tensor at a time. By employing this technique, we can reduce the amount of memory required to run the network. The following diagram demonstrates a simple example of this for a small network containing only two NHWC convolutions. It considers both the 'standard' sequential way of scheduling where operators are executed one after the other, and a cascaded scheduling using an output stripe size of (1, 3, 3, 4). Also note that the diagram omits the weight and bias tensors.
+
+![meta-schedule-workflow](../resources/cascading-diagram.png)
+
+Depending on the choice of stripe sizes, cascading groups and the network architecture, the memory saved can be substantial. For example, for Mobilenetv1 when run naively it requires in excess of 1.2MB of memory. However, intelligent application of the cascading technique can reduce this to around 300kB making it much more accessible to deeply embedded devices. In addition, where a hierarchical memory is present with faster and slower regions, reducing the memory requirement can improve performance by allowing the intermediate results to fit entirely within the fast memories or caches.
+
+## Affine Transforms
+
+Deciding on exactly which operators should be cascaded and with what striping parameters is a non-trivial problem. Most local solutions are not globally optimal and the search space is very large. We therefore introduce a technique using affine transforms to allow for the quick calculation of the memory and performance of a given cascading option without having to perform a proper scheduling using TE/TIR.
+
+They key piece of information to calculate in order to characterize a cascade is how the stripe size changes throughout. This is a function of the data dependency between an operator's inputs and outputs. For many operators that we're interested in, an affine transform matrix can be used to represent this dependency if we represent the input and output stripe sizes as vectors. Affine transforms typically consider 'augmented' matrices and vectors (https://en.wikipedia.org/wiki/Affine_transformation#Augmented_matrix) which allow for the representation of constant changes. Concretely, we define the transform matrix M as being the matrix for which the following holds:

Review comment:
       ```suggestion
   The key piece of information to calculate in order to characterize a cascade is how the stripe size changes throughout. This is a function of the data dependency between an operator's inputs and outputs. For many operators that we're interested in, an affine transform matrix can be used to represent this dependency if we represent the input and output stripe sizes as vectors. Affine transforms typically consider 'augmented' matrices and vectors (https://en.wikipedia.org/wiki/Affine_transformation#Augmented_matrix) which allow for the representation of constant changes. Concretely, we define the transform matrix M as being the matrix for which the following holds:
   ```

##########
File path: rfcs/0037-arm-ethosu-cascading-planner.md
##########
@@ -0,0 +1,151 @@
+- Feature Name: Arm® Ethos™-U Cascading Planner
+- Start Date: 2021-09-22
+- RFC PR: [apache/tvm-rfcs#0037](https://github.com/apache/tvm-rfcs/pull/37)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+This feature builds upon the support introduced into TVM to compile for Arm® Ethos(TM)-U NPUs (see [RFC](https://github.com/apache/tvm-rfcs/pull/11)). In that work, we represent NPU operators using Tensor Expressions, allowing them be scheduled using TVM’s scheduling language. The Planner aims to take advantage of this by deriving scheduling strategies for graphs off-loaded to the NPU which are optimal in both performance and the memory required to run the network. The Planner primarily searches for inter-operator scheduling opportunities rather than intra-operator which is more common in both AutoTVM and the Auto-Scheduler. In particular, it seeks to leverage the technique of ‘cascading’.
+
+# Motivation
+[motivation]: #motivation
+
+On deeply embedded devices, working memory in the form of SRAM is at a premium. It may typically be measured in kilobytes and this severely limits the selection of networks that can be run on such devices. For Arm® Ethos™-U NPUs, this becomes a significant issue as the NPU provides sufficient compute power to run larger models but the system may not have the memory necessary to execute them. The Cascading Planner seeks to alleviate this by rescheduling the networks such that they use less working memory and accordingly can take advantage of the considerable compute capabilities of the NPU.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+The Cascading Planner will act as a scheduler for the Arm(R) Ethos(TM)-U NPU compiler in TVM. This means it will make decisions about the order in which operations are executed, as well as which memories tensors are assigned to and whether they need copying between memories. It's primary objective is to reduce the required working memory needed to run the network until it fits into a user-specified memory budget. The behaviour of the Planner will be controlled by a number of user-configurable options which will ultimately be exposed all the way up to TVMC (exact interface TBD). However, it is expected that for most ordinary use-cases, the Planner will be invisible to the user as it will be configured with sensible defaults.
+
+Additionally, the Planner will rely on a description of the device's memory layout to know the bandwidth and size of the different sections. This allows the Planner to effectively optimize models such that they fit within the memory constraints while still giving good performance. It is expected that this information will be derived from the 'PoolInfo' mechanism that is proposed to be used by the Unified Static Memory Planner ([RFC](https://github.com/apache/tvm-rfcs/pull/9)).
+
+Schedulers for the NPU are required to expose an interface of TE Schedule -> TE Schedule. Note that a side-channelled constant dictionary is required due to the inability to express constants in TE/TIR. Should progress be made in resolving that restriction, this side-channel would be dropped from the design. Refer to the following diagram to see where the Planner fits into the compilation flow for Ethos-U:
+
+![meta-schedule-workflow](../resources/planner-flow.png)
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Cascading
+
+Cascading is the term we use for a form of inter-operator scheduling by which a series of operations are executed together as a sequence of dependent N-dimensional tiles. Those 'N-dimensional' tiles we refer to as 'stripes', corresponding to calculating only part of a tensor at a time. By employing this technique, we can reduce the amount of memory required to run the network. The following diagram demonstrates a simple example of this for a small network containing only two NHWC convolutions. It considers both the 'standard' sequential way of scheduling where operators are executed one after the other, and a cascaded scheduling using an output stripe size of (1, 3, 3, 4). Also note that the diagram omits the weight and bias tensors.
+
+![meta-schedule-workflow](../resources/cascading-diagram.png)
+
+Depending on the choice of stripe sizes, cascading groups and the network architecture, the memory saved can be substantial. For example, for Mobilenetv1 when run naively it requires in excess of 1.2MB of memory. However, intelligent application of the cascading technique can reduce this to around 300kB making it much more accessible to deeply embedded devices. In addition, where a hierarchical memory is present with faster and slower regions, reducing the memory requirement can improve performance by allowing the intermediate results to fit entirely within the fast memories or caches.
+
+## Affine Transforms
+
+Deciding on exactly which operators should be cascaded and with what striping parameters is a non-trivial problem. Most local solutions are not globally optimal and the search space is very large. We therefore introduce a technique using affine transforms to allow for the quick calculation of the memory and performance of a given cascading option without having to perform a proper scheduling using TE/TIR.
+
+They key piece of information to calculate in order to characterize a cascade is how the stripe size changes throughout. This is a function of the data dependency between an operator's inputs and outputs. For many operators that we're interested in, an affine transform matrix can be used to represent this dependency if we represent the input and output stripe sizes as vectors. Affine transforms typically consider 'augmented' matrices and vectors (https://en.wikipedia.org/wiki/Affine_transformation#Augmented_matrix) which allow for the representation of constant changes. Concretely, we define the transform matrix M as being the matrix for which the following holds:
+
+![meta-schedule-workflow](../resources/cascading-formula-1.png)
+
+Let's briefly consider how to derive such a transform matrix for a 3x3 unstrided, undilated and unpadded NHWC convolution. Immediately, the '3x3' kernel tells us something important: a single element in the output depends on 3x3 elements in the height/width of the input. If we were instead to consider a 2x2 region of the output in the height/width dimensions, we'd then need a 4x4 region in the input. So in general, the rule is that we need 2 more elements in height and width when calculating the dependencies of an output stripe. It can be shown that more generally this number is the kernel_size-1 in each axis. Now to consider the channels, in a convolution no matter how many output elements you are computing you'll always need every input channel. This is because the input channel axis is a reduction axis in a convolution, in a sense it isn't 'reflected' in the output. Combining these two observations, we arrive at the following transform matrix:
+
+![meta-schedule-workflow](../resources/cascading-formula-2.png)
+
+The first row is simply an identity for the batch axis. The second and third rows are more interesting, corresponding the the data dependencies in the height and width axes. Here we see that the matrix is taking the output size and adding a fixed value of 2, as we described above. The fourth row is the channels axis, and as we expect this is constant and always 8 - the number of channels of the input tensor. The final row is simply the augmentation row and can be ignored.

Review comment:
       ```suggestion
   The first row is simply an identity for the batch axis. The second and third rows are more interesting, corresponding to the data dependencies in the height and width axes. Here we see that the matrix is taking the output size and adding a fixed value of 2, as we described above. The fourth row is the channels axis, and as we expect this is constant and always 8 - the number of channels of the input tensor. The final row is simply the augmentation row and can be ignored.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org