You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tvm.apache.org by Eric Lunderberg <no...@github.com.INVALID> on 2022/06/06 20:38:14 UTC

[apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

This RFC introduces a method to specify padding to be applied as part of a buffer layout transformation, to be used when the desired layout does not evenly tile the buffer being transformed, and simplifications that can be performed based on these padded buffers.

The motivating examples are primarily in the &quot;Implementation options&quot; section, which goes through several desired usages of the buffer padding, and how they can be automatically derived using the TIR primitives/transformations described in earlier sections.

TODO: Rendered Markdown link
You can view, comment on, or merge this pull request online at:

  https://github.com/apache/tvm-rfcs/pull/77

-- Commit Summary --

  * [RFC] Buffer Layout Padding

-- File Changes --

    A rfcs/XXXX-layout-transform-padding.md (2521)

-- Patch Links --

https://github.com/apache/tvm-rfcs/pull/77.patch
https://github.com/apache/tvm-rfcs/pull/77.diff

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77
You are receiving this because you are subscribed to this thread.

Message ID: &lt;apache/tvm-rfcs/pull/77@github.com&gt;

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Tianqi Chen <no...@github.com.INVALID>.

Thanks @Lunderberg for the update, I think we are moving towards positive direction of overall IR design. Some additional feedbacks:

## Keep Schedule Decisions Local to PrimFunc then Compose

On schedule primitives, to be pragmatic, it would be helpful to have some of the cross PrimFunc re-flowing done in two steps. Specifically, some of your `transform_layout` example of the functions touches buffers that involves input. One approach is of course to trace up to its producers and then rewrite the producers function as well (or trace down to consumers functions). However, the complication here is that:

- There can be multiple consumers/producer TIR functions
- In certain cases producer/consumer may not have consistent requirements.
- The producer/consumer themselves can have their own local layout preferences that needs to be consolidated.

In general it is helpful to first keep schedule decision local, e.g. introducing a caching stage (AC, BC in the example), the compose with another reflowing pass to bring the decision to consumer/producers. This is mainly to reduce the overall complexity in implementing such transformations, and also makes things more modular.

```
@T.prim_func
def grow(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    AC = T.alloc_buffer([4, 4], "int32")
    BC = T.alloc_buffer([4, 4], "int32")

    for io, ii in T.grid(4, 4):
         with T.block():
	      T.block_attr("preproc", "pad")
              AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)

    for i, j in T.grid(4, 4):
        BC[i, j] = 2 * AC[i, j]

    for io, ii in T.grid(14):
        with T.block():
            T.block_attr("postproc", ["crop", 0])
            B[io, ii] = BC[4 * io + ii]

@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    for i in T.grid(14):
        B[i] = A[i] + 1

@R.func
def main(A: T.Tensor[14, "int32"]):
	lv0 = call_tir(grow, [A], (14))
	# an intermdiate stage to show non-local reflowing
	lv1 = call_tir(addone, [lv0], (14))
	lv2 = call_tir(grow, [lv1], (14))
	...
```

## Use IfThenElse expression for Padding.

While it is possible to express padding with a loop and another loop that writes the padded value, it is harder to schedule the resulting blocks as there are more than one producers. Having a single loop and use `T.if_then_else ` will express such pattern in a single shot and makes future rewriting easier.


```python
    for io, ii in T.grid(4, 4):
         with T.block():
	      T.block_attr("preproc", "pad")
              AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)
```

## Propagate Padding Decisions from the End.
 
Some of the complications of duplicated condition(and their simplification) roots from the fact that we do layout transform of output and input separately(each introducing their own conditions which then needs to be simplified). It might be helpful to do a global transformation, usually driven from the output, then "backprop" the implication of that decisions to the input. Doing such transformation at a single shot will likely alleviate the need of generating extra conditions then simplifying them.



-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1170294348
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

> Our design principle at TIR level ideally we start with one instance of possibility, then use probabilistic space of meta-schedule to represent multiple choices.

For this, would the layout re-flowing occur periodically during optimization?  Otherwise, including transformations in the performance benchmarking of candidates would unfairly penalize candidates that add a transformation step, while excluding transformations would unfairly bias toward transformations, even when sequential operators require separate layout transformations.

Representing different options as different allowed steps in a search space makes sense to me, so long as the candidates are reasonably exposed to optimizer.

> In our particular example, however, the idea is that the schedule primitive do not modify the input/output buffer, but introduce preproc and postproc stages with clear hint that they should be lifted out (aka we are doing the same thing in two steps)

I think I understand.  That would effectively be treating the preproc/postproc stages as separate function bodies, but ones which happen to exist within the same TIR PrimFunc for ease of use.

With this representation, I think the biggest part would be determining when to fix a previously free parameter, in order to expose it as an assumption to another TIR PrimFunc.  Maybe in the "Step 2: Reflowing of Layouts", this isn't used to cancel any statements out, but instead to create a dynamic performance penalty if an assertion is no longer held, with the performance penalty equal to the time required to do the transformation.

> As a quick intermediate middle ground. For most intermediate stage(really like add or exp), we would ideally not insert any layout decisions and allow decisions from other key ops(conv and matmul) to backprop their decision to them.

I'd agree, though would phrase it somewhat differently.  The element-wise operations impose a constraint such that input and output layouts, that the input and output have identical layouts.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1165889831
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

Writing out some of my thoughts, to see if there's a way to express the constraints while only using existing TIR features.  The main goals would be as follows.

1. Allow simplification of expressions based on the values present in the padding.
2. Allow local simplifications to take advantage of non-local constraints, without requiring a full end-to-end analysis.
3. Specify the non-local constraints in some deducible manner that doesn't impose a runtime performance penalty.
   
Next, working through various options for how the constraints could be stored. In the examples below, sketching out how these would apply to the element-wise operation which starts as below.

```python
@T.prim_func
def func(A: T.Buffer[(14), "int32"], B: T.Buffer[14, "int32"]):
    for i in T.serial(14):
        B[i] = 2 * A[i]
```

1. Apply layout transforms on local caches.  Here, the full lifetime of a buffer is known.  All TIR optimization are done prior to hoisting the cache and layout transformation into the graph level.
   
   - For read caches, pad value is whatever gets conditionally written to the padding while generating it.  In example below, `AC` could be recognized as being padded.
     
     ```python
     @T.prim_func
     def func(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
         AC = T.alloc_buffer([4, 4], "int32")
         for io, ii in T.grid(4, 4):
             if 4 * io + ii < 14:
                 AC[io, ii] = A[4 * io + ii]
             else:
                 AC[io, ii] = 0
     
         for i in T.serial(14):
             B[i] = 2 * AC[i // 4, i % 4]
     ```
     
   - For write caches, pad value is whatever is in the padding after the last write to the cache.  In example below, `BC` could be recognized as being padded.

     ```python
     @T.prim_func
     def func(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
         BC = T.alloc_buffer([4, 4], "int32")
         for io, ii in T.grid(4, 4):
             if 4 * io + ii < 14:
                 BC[io, ii] = 2 * A[4*io + ii]
             else:
                 BC[io, ii] = 0
     
         for io, ii in T.grid(4, 4):
             if 4 * io + ii < 14:
                 B[i] = BC[io, ii]
     ```

   - Downside, either of the `else` statements could be eliminated as a no-op, since they don't contribute to the output `B` value. After that elimination, there wouldn't be any way to reconstruct the pad value.
     
2. When hoisting an allocation+transformation, write the pad value to the buffer at the start of function from which it was hoisted. This way, the pad value can still be used in local reasoning.
   
   - No change needed in producers, since they would already write the pad value to the buffer.
   
   - For consumers, would be represented as writing `pad_value` into the padding at the start of the function.
   
     ```python
     @T.prim_func
     def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]):
         for io, ii in T.grid(4, 4):
             if 4 * io + ii >= 14:
                 AC[io, ii] = 0
     
         for io, ii in T.grid(4, 4):
             if 4 * io + ii < 14:
                 B[4 * io + ii] = 2 * AC[io, ii]
     ```
     
   - Downside, repeated unnecessary effort at the beginning of each consumer.  Avoiding it with this representation would require knowing that the producer had written `pad_value` already, which is exactly the information we're trying to avoid.
     
3. When hoisting an allocation+transformation, write the pad value to the buffer at the start of function from which it was hoisted, and write `T.undef()` at the end.  This way, the pad value can still be used in local reasoning, and no-op removal can remove the repeated writing when lowering.
   
   - No change needed in producers, since they would already write the pad value to the buffer.
     
   - For consumers, would be like option 2, but with an additional write of `T.undef()` at the end of the function.  When lowering, the write of `T.undef()` would allow the first write to be removed as a no-op because it is overwritten.  The `T.undef()` can then be removed as described in the RFC.
   
     ```python
     @T.prim_func
     def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]):
         for io, ii in T.grid(4, 4):
             if 4 * io + ii >= 14:
                 AC[io, ii] = 0
     
         for io, ii in T.grid(4, 4):
             if 4 * io + ii < 14:
                 B[4 * io + ii] = 2 * AC[io, ii]
     
         for io, ii in T.grid(4, 4):
             if 4 * io + ii >= 14:
                 AC[io, ii] = T.undef()
     ```
     
   - Downside, no way to distinguish between "can assume the pad value is zero" and "can overwrite the pad value at will".  The writing of `T.undef()` would allow any writes to the padding to be inserted as a no-op.
     
   - Downside, wouldn't actually simplify out in cases where the pad value is used.  The first in a pair of repeated writes to the same location can only be removed if there are no reads between the writes.  After using the pad value to eliminate `if 4 * io + ii < 14` from the compute, the dummy loop that writes the padding could no longer be removed.
     
4. Use `AssertStmt` in a loop to declare known information about the buffers.

   - No change needed in producers, since the pad value is already written out.
     
   - For consumers, would have an initial loop that asserts the pad value is correct.

     ```python
     @T.prim_func
     def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]):
         for io, ii in T.grid(4, 4):
             if 4 * io + ii >= 14:
                 assert AC[io, ii] == 0, "padding"
     
         for io, ii in T.grid(4, 4):
             if 4 * io + ii < 14:
                 B[4 * io + ii] = 2 * AC[io, ii]
     ```
     
   - Downside, assert statements have target-dependent handling.  In `CodeGenLLVM` and `CodeGenSPIRV`, they are treated as no-ops.  In `CodeGenCPU` and `CodeGenC`, they generate asserts.  In `CodeGenCUDA`, they aren't handled at all and would error out.
     
     Could work around this with a lowering pass, but identifying these conditions would require having a special string in the message, and packing structured data into strings makes me wary.
     
5. Use `AssertStmt` with implicitly-defined variables to declare known information about the buffers.
   
   ```python
   @T.prim_func
   def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]):
       a = T.var("int32")
       b = T.var("int32")
       assert (
           AC[a, b] == 0 or (4 * a + b < 14) or (a < 0) or (a >= 4) or (b < 0) or (b >= 4)
       ), "padding"
   
       for io, ii in T.grid(4, 4):
           if 4 * io + ii < 14:
               B[4 * io + ii] = 2 * AC[io, ii]
   ```
   
   - Can apply to clamped texture memory, since the variables in the assertion isn't restricted to the bounds.
     
   - Would need to recognize specific pattern of `BufferLoad` being used to define variables used in constraint.
     
   - The implicitly-defined variables can be written in current TIR, but  variables would ensure that this isn't something that ever makes it into generated code at runtime.
   
   - Downside, implicitly-defined variables are something of a red flag.

6. Store constraints in the function attributes, either as a dictionary or as a structured object.
   
   ```python
   @T.prim_func
   def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]):
       T.func_attr(
           "buffer_constraints",
           [
               {
                   "buffer": AC,
                   "predicate": lambda io, ii: 4 * io + ii < 14,
                   "pad_value": lambda io, ii: 0,
               },
           ],
       )
   
       for io, ii in T.grid(4, 4):
           if 4 * io + ii < 14:
               B[4 * io + ii] = 2 * AC[io, ii]
   ```
   
   - Downside, requires transformations that change a buffer to be aware that other structures will also need to be replaced.
     
   - Downside, requires simplifications to either be passed the entire `PrimFunc`, or to be explicitly passed the `"buffer_constraints"` list.
     
   - Downside, would break expectations of `IRMutatorWithAnalyzer`. The current entry point of any `Stmt` or `Expr` would need to have additional information of the `"buffer_constraints"`.
     

7. Store constraints in the `Buffer` object, either as a dictionary or as a structured object.
   
   ```python
   @T.prim_func
   def func(ac: T.handle, B: T.Buffer[14, "int32"]):
       AC = T.match_buffer(
           shape=(4, 4),
           dtype="int32",
           constraints=[T.BufferConstraints(predicate=lambda io, ii: 4 * io + ii < 14, pad_value=0)],
       )
   
       for io, ii in T.grid(4, 4):
           if 4 * io + ii < 14:
               B[4 * io + ii] = 2 * AC[io, ii]
   ```
   
   - Downside, introduces additional data structure in TIR.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163620046
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Wuwei Lin <no...@github.com.INVALID>.

Thanks everyone for the discussions. We have agreed on the design principles and will continue to explore scheduling options. Let's keep the RFC open for final comments until the end of this week.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1182544837
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Tianqi Chen <no...@github.com.INVALID>.

following up on this, I think we are in broad stroke agreement that we can achieve our goals with blocl/fn attributes in IR as well as builtin assume. As a result, my original blocker for the RFC has been resolved, would still be great to work together to flesh out the details of schedule primitives and how do they interact with the rest of TIR scheduling, but I somewhat think they can be done separately and we don;t need to nail down the details of primitives.

The schedule primitives can be done relatively independently as long as we agree on the principle that:
- Transformations do not change the function interface behavior
- We decouple the graph level decisions into two steps: local decision + rewrite

We can explore possible options as long as the IR spec remains stable, if there is a need to update IR itself or meaning of attribute, we can come back and discuss again

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1180606480
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

> For example, we may introduce explicit cache stage to add the padding, and mark this block for later processing.

Wouldn't that require a "remove entirely" annotation that was suggested against [here](https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163019805)?  I could see how we could mark a transformation to be hoisted out later, but when some simplifications require the constraint to be expressed in the producer, and others in the consumer, exposing it to both `PrimFuncs` for local simplifications would require either duplication of the block, or maintaining non-local information only for a single pass.  If the stage is duplicated, all but one of the duplicates would need to be marked as temporary.  If the information is only retained for a single pass, then any scheduling/optimization of a single subgraph would require walking through the entire end-to-end model.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163616169
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Tianqi Chen <no...@github.com.INVALID>.

>  In general, a PrimFunc's interface could only be changed when calls into the PrimFunc are also modified to remain compatible.

Agreed, that is what I originally intended to say 

> Is there a better term than "scheduling primitive" to describe layout transformations that impact input/output buffers? I think the difference is between context-independent transformations that may be performed on a PrimFunc without changing, as opposed to context-dependent transformations that may only be performed as part of a graph-level transformation.

There are a few things, one approach would be to allow schedule primitive to modify multiple functions(including callers), we might need this for more complicated cases.

In our particular example, however, the idea is that the schedule primitive do not modify the input/output buffer, but introduce preproc and postproc stages with clear hint that they should be lifted out (aka we are doing the same thing in two steps)




-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1165724403
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Wuwei Lin <no...@github.com.INVALID>.

Merged #77 into main.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#event-7004931146
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Chris Sullivan <no...@github.com.INVALID>.

Thanks for sharing the contextual pointers for the community @vinx13. Agreed the approaches discussed are both valid. I would actually like to argue the stronger point that they are complimentary and are only appearing to be contrary because we are considering too narrow of a scope.

It can be helpful to share an overview of common handlings of layout transformations in ML compilers. Most of my arguments against A0 (local cancellation in the graph only) being sufficient stem from prior experiences in graph layout optimizations. The evolution of the graph approaches for optimizing layouts I've seen followed the below trajectory:

1) The first graph approach that is often taken is what has been argued so far in A0, local back to back cancellation. It works reasonably when the data flow and op variety in a model are simple.

Local-only cancellation tends to fail in models which still have simple data flow, but more variety in the sequence of operators, each with different valid implementations. Consider,

```
-> transformX -> conv2d -> (inv_transform_X) -> pool -> (transformX) -> conv2d -> inv_transform_X
```
In this case `pool` can be replaced by any sequence of operations that are layout agnostic or for which there exists multiple implementations, and so the choice of layout is unconstrained. In this case these operators are layout unconstrained, whereas the convolutions are layout constrained. As you can see even for a simple model, the approach discussed in A0 already needs to be modified to support non-local layout analysis and folding.

2) The typical second approach is then to still utilize A0, but to first apply a pre-processing pass that sinks layout transforming operations in the graph along the path of data flow [[1](https://github.com/pytorch/glow/blob/56249602c9ec93fa586cea2ce8ab315003478eed/lib/Optimizer/GraphOptimizerPipeline/FunctionPassPipeline.cpp#L69), [2](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L542)]. The above case then becomes,

```
-> transformX -> conv2d -> pool -> (inv_transform_X -> transformX) -> conv2d -> inv_transform_X
```
Then apply the method discussed in A0 and do local cancellation.

The above method works well for models with relatively simple data flow but for models with more branching the method has limitations. A simple consideration is sinking a transform through an operation with multiple inputs. The process of doing so requires materialization of the inverse transform on the other operands.

For the sake of simplicity consider matrix multiplication: ${A^{T}}B = {(B^{T}A)}^T$, in this case the final state of sinking the transpose on A was to materialize two transposes rather than one, one on B and one on the matmul. Sinking-alone isn't sufficient to guarantee a globally optimal layout because it still only treats the propagation of transforms locally/greedily.

3) A modification to sinking (downward along data flow) is to introduce upward flowing [[3](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L156)]. It can help by flowing transforms along the poisoned operands (e.g. B in the above matrix multiply) by propagating the transform up as far as possible, hopefully to a constant where it can be folded.

For inference graphs I've seen this approach work well. But the approach is still greedy and suboptimal choices can occur. For training graphs this approach works less well due to the data flow complexity involved with branching from the forward to backward graph and the optimizers in place update of weights. I omit a specific example in this case for brevity, but encourage the review of of the graphs from @t-vi application of TVM to pytorch training for Bert and the long chains of transpose and reshapes that occur within the forward and backward m-h attention layers [[3](https://github.com/apache/tvm-site/blob/85c7e4ebf6d9ed221075e38e5e5e1a0052693acc/_posts/2020-07-14-bert-pytorch-tvm.md)].

4) Finally, to arrive at a closer to globally optimal solution for layout, different constraint-based approaches are considered. Constraints from operations which are layout constrained can be flowed across unconstrained parts of the graph until an approximate global optimum is reached.

An example implementation I have seen included layout sources (e.g. operators like conv2d on an NPU with distinct layout constraints) and layout sinks (e.g. operations which involve data movement by DMA engines or in-memory compute which allow zero-cost data layout rearrangement during store). A constraint solver in this case flows layout constraints from sources toward sinks that can absorb aggregated/merged layout transform constraints.
____

Coming back to the present discussion, I believe our design should be focused on ensuring that one or more of the non-local approaches discussed above in 2-4 are achievable. Any of these cases require the following components:

C0) The ability to track constraints on a buffer.

C1) The ability to roundtrip between an IR representation and the producer/consumer constraint representations.

C2) The ability to merge/fold constraints - flowing is just merging a constraint with an unconstraint.

Even for the pure local (back-to-back) case discussed in A0, components C1 and C2 are helpful with the caveat that the inferred constraints from the IR only exists within the local context of a single producer consumer pair in a pass.

Thus both A0 and A1 can benefit from these components, and the delta that exists between A0 and A1 is clearer:

* Delta 1: In A1 buffer constraints are maintained per buffer in the graph globally (non-local); and therefore can be optimized by any of the methods 2-4 discussed.

* Delta 2: In addition to inferring buffer constraints from IR (one half of C1), A1 proposes for constraint expression about the memory during scheduling to be maintained for some time.

--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1153227651
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

Thank you very much on the comments, suggestions, and discussion, and I'm quite happy with how the design evolved over the course of the discussions!

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1182157349
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Tianqi Chen <no...@github.com.INVALID>.

Adding some additional discussion with @csullivan .

We agree that:
- There are different ways to encode layout and padding decisions:
- E0: BufferConstraint(as element in the IR)
- E1: Composing a stage that transforms the layout(a loop that represents the mapping)
- Non-local rewrites are needed to be able to propagate the layout and padding decision through out the entire model through constraint solving.

Right now we have some healthy discussions about ways to encode layout and padding decisions.

Some of my thoughts:

Introducing changes to TIR would needs some additional thoughts that deserves some extra consideration. Due to the N*M complexity (where N is the TIR possibilities and M is the number of primitives to be supported).

Right now it is possible to do non-local constraint rewriting flowings as part of the graph pass. Note that while E1 is indeed less "compact" on one hand, we can use it to reconstruct THE compact data structure(that represents the layout decision) that we can use to flow the decisions across the graph node. E1 also enables some additional capabilities (e.g.) expressing future memory remappings that do not necessarily fit into padding/packing.

Starting from the graph level allows us to capture learnings, then use some e2e goals to make an informed decision on TIR level change later if needed.

--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1156407362
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Wuwei Lin <no...@github.com.INVALID>.

Thanks for the discussion. To provide more context, the A0 approach we discussed is TIR-Relax layout rewriting https://github.com/tlc-pack/relax/issues/162 (the general idea is to lift such transformation in TIR scheduling into the graph, and then cancels out redundant intermediate transformations by either proving fusing the pair of post-compute and pre-compute transformations produces an identity TIR function, or use high-level operator semantic). I think this is very similar to  the [graph-level solution](https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/4)  mentioned by @wrongtest 
In general, both A0 and A1 are valid approaches. It is mainly about how we would like to handle the complexity in simplifications.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1152992143
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Andrew Reusch <no...@github.com.INVALID>.

this is on the agenda for tomorrow's [community meeting](https://discuss.tvm.apache.org/t/next-tvm-community-meeting-june-8-2022/12900). Perhaps we could discuss in higher bandwidth there?

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1149027473
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Tianqi Chen <no...@github.com.INVALID>.

Added some examples to build on top of @Lunderberg 's example

## Transformation

The main difference between annotation and special handling are:

- annotation is not necessarily to for correctness of the program, but it may provide hints towards future optimizations
- Without annotation, the program still runs correctly, but certain optimizations may not trigger

### Step 0: Produce temp stages with annotation

The transformation produces temporary buffers (AC and BC), where the relation between those data and the A, B are recorded in two blocks(preproc and post proc).

Note that these additional annotations are hint for compilers to perform future optimizations(e.g. to lift them out our cancel. Our eventual goal could be directly reason those properties from the code, but annontations provides a first short cut.

```python
@T.prim_func
def grow(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    AC = T.alloc_buffer([4, 4], "int32")
    BC = T.alloc_buffer([4, 4], "int32")

		for io, ii in T.grid(4, 4):
				with T.block():
					T.block_attr("preproc", "pad")
				AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)

    for i, j in T.grid(4, 4):
        BC[i, j] = 2 * AC[i, j]

    for io, ii in T.grid(14):
				with T.block():
						# hint that this is a cropping operation, 
						# where we know that the remaining part in B is 0
						# Additionally, the remaining uncovered values 
						# are assumed to be 0, if not provided then no assumptions are made
						T.block_attr("postproc", ["crop", 0])
						B[io, ii] = BC[4 * io + ii]

@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    for i in T.grid(14):
        B[i] = A[i] + 1

@R.func
def main(A: T.Tensor[14, "int32"]):
	lv0 = call_tir(grow, [A], (14))
	# an intermdiate stage to show non-local reflowing
	lv1 = call_tir(addone, [lv0], (14))
	lv2 = call_tir(grow, [lv1], (14))
	...

```

Not the special crop annotation comes with an `assumed_value`, which is provided as part of transformation (and actually we can prove that it is safe if our layout transformation starts from B and go backwards.

### Step 1: Reconstruct constraint at TIR-Graph level

By looking at the primfunc, we know that there is a desire to split out the preproc stage and postpost stage to the graph. Although it is totally fine for the compiler to choose not to do so and it is still a valid program. But let us say we choose to lift them out

```python
@T.prim_func
def grow_packed(AC: T.Buffer[[4,4], "int32"], BC: T.Buffer[[4,4], "int32"]):
    for i, j in T.grid(4, 4):
        BC[i, j] = 2 * AC[i, j]

@T.prim_func
def pad(A: T.Buffer[14, "int32"], AC: T.Buffer[[14, 14], "int32"]):
		for io, ii in T.grid(4, 4):
				with T.block():
						T.block_attr("preproc", "pad")
						AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)

@T.prim_func
def crop_with_pad_assume(BC: T.Buffer[[4,4], "int32"], B: T.Buffer[14, "int32"]):
		# Note that this crop carries a pad assertion(of other values of BC)
    for io, ii in T.grid(14):
				with T.block():
						T.block_attr("postproc", ["crop", 0])
						B[io, ii] = BC[4 * io + ii]

@R.func
def main(A: T.Tensor[14, "int32"]):
	lv0 = call_tir(pad, (4, 4), A)
	lv1 = call_tir(grow, [lv0], (4, 4))
	# These are two things that we want to use for global format reflowing	
	lv2 = call_tir(crop_with_pad_assume, [lv1], (14))
	lv3 = call_tir(addone, [lv2], (14)
	lv4 = call_tir(pad, [lv2], (4, 4))
	lv4 = call_tir(grow, [lv3], (4, 4))
	lv5 = call_tir(crop_with_pad_assume, [(14))
```

### Step 2: Global Reflowing of layouts

Now as a last step, let us say we will do global reflowing.

- Start from reverse topo DAG order,
- Whenever we encounter a pad, we reconstruct a in-memory data structure(something like BufferConstraint, e.g. PadMapping(constraint, pad_value=0))
- We try to “backrop” the PadMapping through out the graph
- Each function needs to have its own TIR analysis of how it flows things back, for example, in the case of `addone`, we can safely flow PadMapping back, changing `addone` to `addone_packed` by analyzing the TIR. If the `addone` is elemwise exp however, we need to insert a select operator(because `exp(0)=1` ) the message to input becomes `PadMapping(constraint, pad_value=undef)`.
- When `PadMapping` meets `crop_with_pad_assert`, we can attempt to simplify and cancel out
- When there are branches, transpositions in the graph level or other more complicated issues, we might choose to materialize

### Discussion

There are a few key properties that is really desirable here:

- transformation of PrimFunc do not change the PrimFunc interface: this is really important so we can transform a PrimFunc without worrying about how the graph interacts with it(as the interface remains the same, we can lift out the blocks earlier)
- There are implicit assumption generated(`crop_with_pad_assume` ) to enable some simplification(otherwise a select is necessary, which is also not as bad). Note that assumption are generated under a global context (when we do transform padding we actually know that the overflowing field are 0). But extra amount of care is needed when we attempt to move `crop_with_pad_assume` , as it really depends on the value property of its input. A high-level gist is we should not do that, and instead the global reflowing of layout will reflow the `PadMapping` to `crop_with_pad_assume` then cancel it out.

Talking about “constraints”, it is also useful to talk about categories of them, roughly we can divide them into three categories.

- static_assert: We want to assert some invariance of the code, it is also necessary to “proof” that it is the case during compile time, otherwise compilation error needs to be raised.
- (runtime) assert: We want to assert some invariance of the code, it is not necessary to “proof” that this is the case, but we need to do runtime checking if it cannot be proved.
- assume (from __builtin_assume): We want to assert some invariance of the code, it is not necessary to “prove” that it is the case during compilation.

All three types of constraints can be helpful. In our particular case, `assume` is being generated in `crop_with_pad_assume`.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1164440693
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Wuwei Lin <no...@github.com.INVALID>.

> > For example, we may introduce explicit cache stage to add the padding, and mark this block for later processing.
> 
> Wouldn't that require a "remove entirely" annotation that was suggested against [here](https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163019805)? I could see how we could mark a transformation to be hoisted out later, but when some simplifications require the constraint to be expressed in the producer, and others in the consumer, exposing it to both `PrimFuncs` for local simplifications would require either duplication of the block, or maintaining non-local information only for a single pass. If the stage is duplicated, all but one of the duplicates would need to be marked as temporary. If the information is only retained for a single pass, then any scheduling/optimization of a single subgraph would require walking through the entire end-to-end model.

@tqchen may clarify. I think it's suggesting marking and lifting the stage to the graph and do global flowing instead of removing it (though from the perspective of the subgraph (PrimFunc) it is removed from the PrimFunc

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163633675
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Wuwei Lin <no...@github.com.INVALID>.

Indeed if buffer is used in annotation value that will change the semantic of a node, however, that are different ways to represent this, as long as it can be reconstructed later. For example, we may introduce explicit cache stage to add the padding, and mark this block for later processing.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163535262
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Wuwei Lin <no...@github.com.INVALID>.

> So long as the constraints can be statically searched for, this approach makes sense to me. I would be more concerned about adding additional semantics to existing nodes, such as a AttrStmt node

It doesn't add additional semantic, the computation semantic stays the same, it is a hint to the graph compiler. Here are an example using `block_attr` https://github.com/tlc-pack/relax/pull/161/files#diff-0c5223fca97ad1b31a686364a9acc65f59282bb256ba7fd70d9241986828abe5R46-R50

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163447517
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

These make sense, and agreed that the TIR->global feedback is important for enabling the layout reflow. Going back through the discussion, I think we're converging on agreement on what features are required, and the main question remaining are how to best provide annotation for non-local information, and how best to express layout transformations while scheduling.

I've made some updates to the text of the RFC, based on the discussions here, primarily to remove the proposed changes to TIR data structures. This follows your comment from a few days ago, which brought up `__builtin_assume` as a comparison.

* Adding an intrinsic `tir::builtin::assume`, which corresponds to the `__builtin_assume` LLVM intrinsic. The emphasis is that these assumptions are primarily to expose non-local information for use in local simplifications.
* Removing `BufferConstraint` entirely. The RFC no longer proposes any changes to TIR data structures, only the `assume` and `undef` intrinsics.
* Describing what assumptions can/should be placed into a PrimFunc when hoisting stages out into independent PrimFuncs, and what transformations are legal based on the choice of exposed assumptions.
* Captured some of the discussion here about the dangers of altering a PrimFunc's interface, and the limited cases where it may be altered.

--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1169188372
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Andrew Reusch <no...@github.com.INVALID>.

we discussed this at the June 6 [community meeting](https://discuss.tvm.apache.org/t/next-tvm-community-meeting-june-8-2022/12900). a significant chunk of the meeting was spent presenting the RFC, and we had about 15 minutes of discussion at the end.

i think there is more to be discussed here. if. we'd like to discuss in high-bandwidth, we can bring this back up at future community meetings. here are notes:

@kparzysz-quic :
- aside from transform_layout, the immediate application i see from this is vectorization of variable-length loops. we should separate the transformation and optimization parts because those two things are logically independent. the transform_layout will generate TIR, and then that TIR is optimized using a set of other passes/techniques.
- @Lunderberg agrees. this is the motivation behind splitting this into "transforms" and "more generic operations." HoistExpression does a large part of what is needed for variable-length loop vectorization by splitting out parts that do depend on a dynamic size from the parts that don't.
- KP is worried that it'll take quite a while to implement enough transforms to get to overcompute (e.g. it's hard to determine whether overcompute can be applied). can we have something that transforms the layout, then allow the user to provide a compute statement that is attested by them to work on the transformed layout without any verification?
- @Lunderberg i think that on its own (assuming ops are fused together by providing a tensorization that defines "this entire fused operation can be replaced with x followed by y"), can be done by
- don't have a good way to express "turn off all additional safeties" but proceed to perform those optimizations.
- could imagine having something analogous to the `undef` (where that is the "least-convenient value") except as the "most convenient value." if it's most convenient to presume a value is 0, then where this value is present, it's legal to assume that the value is 0 and move forward.
- there's also a [partway condition](https://github.com/apache/tvm-rfcs/pull/77/files#diff-a5740745158592278e549c62bd8c7ccb5b6317deb56d1164d8bf845ee4db5e75R1919) that doesn't require any of the overcompute proving, but does get to a useful intermediate using only expression hoisting and insertion of existing if/then's that happen for loop rewrites. after everything's been hoisted and simplified, what falls out naturally is an outer loop that splits up into two inner loops:
- a slow one that handles the edges
- a fast one that handles the interior
this might allow us to get to the point of adding the branchless/vectorizable piece even if it's not the only thing there.
- @tqchen notes that one of the reasons we have complexity here is that we are trying to decompose the problem into more general predicates. if we try to go for less complexity, we could introduce transformations that do more transformations at once and thus require less proving.
- the question remains how we might remove additional unnecessary steps added by the initial layout_transform. on GPUs it might be possible to pad while loading shared memory. in other cases we may need to consult the graph-level model to determine how much padding is needed.
- @Lunderberg notes much of this complexity came from "how can we prove the additional steps are unnecessary?" there are also some additional parts where the constraints written in the copy stage may need to flow upwards from something downstream in the data-dependency graph in order to properly state it.
- between explicitly specifying options over N buffers with different pre-existing layouts and identifying whether a layout transformation would require branching loops to handle the edge, a lot of it boils down to which level of abstraction is the layout decided on and how is that exposed to lower levels of abstraction.

--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1151684472
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by wrongtest <no...@github.com.INVALID>.

Thanks for the all great discussions! It is so excited that we will have a more powerful ability to handle all things like paddings and imperfect tiles.

Since our team rely on the code path of s-tir, we are extremely interested in the story on s-tir. I would be very appreciated if we have some details on s-tir padding. I would like to use a [127, 127, 127] matmul to depict my questions :)

```python
@T.prim_func
def matmul(A: T.Buffer[(127, 127), "float32"], B: T.Buffer[(127, 127), "float32"], C: T.Buffer[(127, 127), "float32"]):
    for i, j, k in T.grid(127, 127, 127):
        with T.block("compute"):
            vi, vj, vk = T.axis.remap("SSR", [i, j, k])
            with T.init():
                C[vi, vj] = 0.0
            C[vi, vj] += A[vi, vk] * B[vk, vj]
```

In current s-tir state, we can construct padded loop and buffer using existing primitives by "split and then fuse" trick:
```python
s = tvm.tir.Schedule(matmul)
blk = s.get_block("compute")
i, j, k = s.get_loops(blk)
s.fuse(*s.split(i, factors=[4, 32]))
s.fuse(*s.split(j, factors=[4, 32]))
s.fuse(*s.split(k, factors=[4, 32]))
s.transform_layout(blk, "A", lambda i,k: ((i // 32) * 32 + i % 32, (k // 32) * 32 + k % 32))
s.transform_layout(blk, "B", lambda k,j: ((k // 32) * 32 + k % 32, (j // 32) * 32 + j % 32))
s.transform_layout(blk, "C", lambda i,j: ((i // 32) * 32 + i % 32, (j // 32) * 32 + j % 32))
```
We will get (if simplified)
```python
@T.prim_func
def func(A: T.Buffer[(128, 128), "float32"], B: T.Buffer[(128, 128), "float32"], C: T.Buffer[(128, 128), "float32"]):
    for i_0_i_1_fused, j_0_j_1_fused, k_0_k_1_fused in T.grid(128, 128, 128):
        with T.block("compute"):
            vi = T.axis.spatial(127, i_0_i_1_fused)
            vj = T.axis.spatial(127, j_0_j_1_fused)
            vk = T.axis.reduce(127, k_0_k_1_fused)
            T.where(i_0_i_1_fused < 127 and j_0_j_1_fused < 127 and k_0_k_1_fused < 127)
            T.reads(A[vi, vk], B[vk, vj])
            T.writes(C[vi, vj])
            with T.init():
                C[vi, vj] = T.float32(0)
            C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vk, vj]
```
Then the only thing left is the condition for padding: `T.where(i_0_i_1_fused < 127 and j_0_j_1_fused < 127 and k_0_k_1_fused < 127)`. I believe we now get to the point on current RFC about over-computation and branch tradeoff. And below are some my questions ~

1. What happened when change to `s.transform_layout(...,  pad_value=0)`? (if we want over-computations)
   - (possible behavior 1) Insert padding filling code as a producer block of `compute`.  
     - since the effect is immediate, maybe we do not need `BufferConstraint` annotations afterwards?
   - (possible behavior 2) Annotate buffers and let lowering passes to handle.
     - we may require `BufferConstraint` to direct lowering passes, 
   - (possible behavior 3) Pass `BufferConstraint` upwards into graph level
     -  thus assume the param buffer match the constraint, do not write edge values.
   
2.  For (1.2)(1.3), it seems encode the `BufferConstraint` into the buffer object is not the only choice.
    - For s-tir,  fix me, at least for common cases the constraint could be treat to be local wrt the transformed block. What if we encode the constraint just into the block, as its memory access properties.
      We found previously, block memory annotations `T.reads`, `T.writes` (`BufferRegion`) have some limitations that they loss conditional access informations. Maybe we can also combine `BufferConstraint` with `BufferRegion`?

    - For graph level annotations, IIUC,  it uses "Tensor" typed value instead of "Buffer" conceptually. Maybe we still need another construction instead of `Buffer` with `BufferConstraint` field? 
      We could also consider instantiate graph level transformation explicitly. This is our solution currently: https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/4. 

    - Nevertheless, if finally we decide extent the buffer node structure, hope we can have an explicit lifetime for the `BufferConstraint` in the TIR lowering. Thus storage related passes afterwards do not bother, especially for customized passes developed by vendors.

3. For the reduce axis padding, mentioned in https://github.com/apache/tvm-rfcs/pull/77#discussion_r894899301
    - In TIR level, since the schedule primitive should preserve the semantic correctness, how we prove the `k` dimension padding should only be zero? Especially when we do not know it is a "matmul" op generally. I think it is important if we want to use padded `transform_layout` in auto-schedule fashion applications.

cc @Lunderberg @tqchen @vinx13 @Hzfengsy 

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1152928725
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

> It doesn't add additional semantic, the computation semantic stays the same, it is a hint to the graph compiler.

My apologies, I had meant the semantics of a node from the perspective of a TIR transformation, not the semantics from the perspective of the computation being described.  For a TIR transformation, if an object is replaced, whatever attributes describe that object must be updated to refer to the new object.  So if constraints are added to the block annotation, I had been thinking of that as a change to the semantics of the `BlockRealizeNode::annotations` from "does not need to be updated when a buffer is replaced" to "must be updated when a buffer is replaced".

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163510231
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Tianqi Chen <no...@github.com.INVALID>.

> For this, would the layout re-flowing occur periodically during optimization?

This is a point where likely different variation of (some sort of search)algorithm might be necessary, our first step would be to allow the TIR level to give such feedback to the global level(via a probabilistic space) and search can be done more smartly.

>  The element-wise operations impose a constraint such that input and output layouts, that the input and output have identical layouts.

Agree, this is something that we can do

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1167249134
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

> Introducing changes to TIR would needs some additional thoughts that deserves some extra consideration. Due to the N*M complexity (where N is the TIR possibilities and M is the number of primitives to be supported) that needs to be handled in implementation (by backend implementers and primitive implementers)

This was part of the design consideration, to minimize the impact of the proposed changes to primitives, lowering transformations, and backends.

* The `BufferConstraint` annotations do not need specific handling at the codegen level, as it is only present to enable compile-time optimizations.
  
* Use of the `BufferConstraint` hints would occur within existing utilities, primarily as additional information available in `arith::Analyzer` utilities.  This minimizes the need for other primitives/transforms to be aware of the buffer constraints, while still benefiting from them.
  
* The `T.undef()` built-in does not need specific handling at the codegen level, as it is removed during lowering.
  
* The `T.undef()` built-in does not require specific handling from other primitives, as stores of `T.undef()` can be treated the same as stores of any other value.
  
> Right now it is possible to do non-local constraint rewriting flowings as part of the graph pass. Note that while E1 is indeed less "compact" on one hand, we can use it to reconstruct the desirable compact data structure(something like BufferConstraint that represents the layout mapping) that we can use to flow the decisions across the graph node during the pass.
  
I definitely agree that graph-level transforms are where the layouts and constraints should be decided.  The `BufferConstraint` annotations are not intended as a way to override in TIR what was already decided at the graph level, but rather a way to communicate to TIR transformations what has been decided at the graph level.

> E1: Composing a stage that transforms the layout(a loop that represents the mapping)

I'm still a bit confused with this approach, specifically how one would avoid having a separate compute definition for each workload on a new target (Initially brought up by @csullivan [here](https://github.com/apache/tvm-rfcs/pull/77#discussion_r893701372).) In my mind, if I'm going to compose a layout transformation stage, it would need to be followed by a compute stage that takes a transformed layout as input.  So rather than having a single conv2d that can be generalized over layouts, each transformed layout would still need to have a compute stage for it.

> Note that intiially such data structure do not need to live beyond the life of a pass, because they can be reconstructed at anytime from the other representation.

How would this be represented while optimizing the performance of a subgraph?  My concern would be how to express the non-local constraints while keeping a small search space for optimization.

* Ensure that the producer and consumer stages are within the same subgraph.  Since the constraints provided to a consumer depend not only on the producer, but also on the constraints provided to the producer, so this might require fusing the entire end-to-end model into a single monolithic kernel.
  
  My understanding is that this would result in a search space that is too large to effectively optimize, though I haven't explicitly tested it.
  
* Insert a transformation stage into the subgraph, in which the constraint is written.  Later portions of the subgraph could then rely on the constraint without examining other subgraphs.
  
  Would need to have some way to indicate that the transformation stage shouldn't be altered during optimization, nor should it be part of the performance timing.
  
* Express the graph-level constraints to a subgraph, so that it can optimize using those constraints.
  
  This was my intent with the `BufferConstraint` annotations, since then the subgraphs could take advantage of
  
> E1 also enables some additional capabilities (e.g.) expressing future memory remappings that do not necessarily fit into padding/packing.

Is there an existing annotation to indicate that a stage should be removed entirely during lowering?  That might be an effective way to allow more general usage by annotating a stage that can be assumed to have been performed prior to the subgraph.  This would be a way to express the second option of an extra transformation stage, while still providing enough information to remove the transformation stage during lowering.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1162392893
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Tianqi Chen <no...@github.com.INVALID>.

> I'm still a bit confused with this approach, specifically how one would avoid having a separate compute definition for each workload on a new target

Indeed it is important to avoid having a separate compute definition for each workload on a new target. In this particular case, all computation definition would start with the original layout. Then there is a "schedule transformation" like transform layout which will generate the new stage as part of the scheduling process.

The particular stage can be marked, which contains effectively the same information as BufferConstraint, except that it does not introduce new data structures. During global layout reflowing, such information can be used to guide the reflowing to reconstruct a data structure like `BufferConstraint` or other Layout mappings and use that to serve the same purpose.

> Is there an existing annotation to indicate that a stage should be removed entirely during lowering?

Ideally we should not introduce annotation to indicate a stage should be removed, as that breaks the interface of the code itself (ideally the computation should remain the same).

However, we can hint to the compiler that this particular stage is a layout transformation that should be lifted and resolved through the global constraint reflowing. Additionally, such annotation can be used to guide benchmarking, such that the overall tuning should only look at non-rewriting part(and we can leverage the transform block to generate input examples correctly).

As a high level summary, the main message is to allow enough info in the TIR(as part of transform block) such that we can reconstruct a `BufferConstraint` like auxiliary data structure in global reflowing, while still making the TIR part self-contained enough so it is sufficient to construct such data structure.

This also helps in cases where there are other graph-level layout rewriting(e.g. transpose) that can be fused with those additional transformation stages.

--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163019805
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

> In general it is helpful to first keep schedule decision local, e.g. introducing a caching stage (AC, BC in the example), the compose with another reflowing pass to bring the decision to consumer/producers.

My goal with the latest update wasn't to require global decisions, but to make local changes only, which could be used in different contexts. For the auto-scheduler, since the context requires maintaining the same PrimFunc interface, local optimization would be restricted to transformations of the caching stage. For stand-alone usage, such as preparing a single PrimFunc for a unit test, the context allows the interface to change. That way, the restrictions to the transformations are imposed by the level of abstraction that requires them.

> While it is possible to express padding with a loop and another loop that writes the padded value, it is harder to schedule the resulting blocks as there are more than one producers. Having a single loop and use `T.if_then_else` will express such pattern in a single shot and makes future rewriting easier.

I definitely agree that this makes the later analysis/rewrites easier. I had maintained them as two separate loops both to minimize the extent of changes being made in any one scheduling change, and to maintain the current behavior of `Schedule.transform_layout` which does not alter the surrounding loop iterators ([previous conversation](https://github.com/apache/tvm/pull/10538#discussion_r826209815) with @vinx13).

I see four main options on how the loopnests could be handled:

1. When a buffer is transformed, all loops in the producer stage over the buffers's pre-transformation axes are replaced with loops over the buffer's post-transformation spatial dimensions. It is an error if this replacement cannot be done (e.g. the pre-transformation loops have been fused/split/reordered).

- Pro: Allows the `T.if_then_else` to be inserted at the time of the transformations.
- Pro: Removes the need for the .
- Con: May restrict search space, since earlier manipulation of the loop iterators would prevent later buffer transformations.
- Con: Doesn't help consumers of a transformed buffer. In a reduction, may be desirable to iterate over the input buffer, but this couldn't be expressed in terms of an output.
- Con: For buffers whose padding is not written to, must either insert a conditional statement or maintain the pre-transformation loop structure.

2. When a buffer is transformed, an attempt is made to replace all loops in the producer stage over the buffers's pre-transformation axes with loops over the buffer's post-transformation spatial dimensions. If this replacement cannot be done (e.g. the pre-transformation loops have been fused/split/reordered), and if `pad_value` is not `None`, then an error should be raised.

- Pro: Always valid to apply a transform
- Pro: Avoids undoing scheduling benefits from previous changes to iterators.
- Pro: Later calls to `reduce_branching_through_overcompute` could still introduce a value for the padding, if the full life cycle of the buffer is known.
- Con: Allowing the follow-up stage at all requires just as much analysis to identify as if it were always present.

3. When a buffer is transformed, all loops in the producer stage over the buffers's pre-transformation axes are replaced with loops over the buffer's post-transformation spatial dimensions. If this replacement cannot be done (e.g. the pre-transformation loops have been fused/split/reordered), then the follow-up stage is inserted.

- Pro: Always valid to apply a transform
- Pro: Avoids undoing scheduling benefits from previous changes to iterators.
- Con: Allowing the follow-up stage at all requires just as much analysis to identify as if it were always present.

4. When a buffer is transformed, all loops over spatial dimensions in the producer are replaced with loops over the post-tranformation buffer axes.

- Pro: Always valid to apply a transform.
- Con: May undo scheduling that has previously provided useful performance improvements.
- Con: Loop iterators over pre-transformation indices may have been fused with reduction axes. Would need to undo the fusion to apply.

The current proposed version would be option 4, but I think I'd prefer option 2 in order to reduce the number of follow-up simplifications required.

> Some of the complications of duplicated condition(and their simplification) roots from the fact that we do layout transform of output and input separately(each introducing their own conditions which then needs to be simplified). It might be helpful to do a global transformation, usually driven from the output, then "backprop" the implication of that decisions to the input. Doing such transformation at a single shot will likely alleviate the need of generating extra conditions then simplifying them.

At the TIR level, I suppose I'm unclear on what "'backprop' the implication of that decisions to the input" would mean, since changing the layout of one buffer doesn't strictly require changing the layout of other buffers. Intuitively, I can picture how it would apply to some operators (e.g. perform analogous transformations on the inputs to element-wise functions) and how those could be identified (e.g. track which indices are used for access of each buffer, and identify corresponding shapes from the indices), but I'm unclear as to how a similar intuition would be applied for more complicated functions. (I'm also not sure if this would require a similarly difficult sequence of proofs as the proposed transforms, just with the goal of proving a preferred layout rather than proving a possible simplification.)

We could allow the user to specify transformations of all buffers simultaneously, but this wouldn't really solve the problem, as the simplifications made would still need to be based on that information provided.

At the graph level, I don't think a single direction of constraint propagation is sufficient. Backward propagation, starting with the output values returned to the user, could track which indices contribute to that final output, which could be exposed to producers. Forward propagation, starting with the input values provided by the user, could track which indices of intermediate buffers contain known values, which could be exposed to consumers.

With these uncertainties, I'm starting to think of `layout_transform` and `pad_value` not as a complete end-to-end handling in itself, but providing a platform on which the graph-level reasoning can be built. That is, it doesn't itself perform the graph-level reasoning, but can accept the layout/padding requirements given from graph-level reasoning.

--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1171290053
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Wuwei Lin <no...@github.com.INVALID>.

Thanks @csullivan for providing the overview. I agree that non-local approaches 2-4 are necessary. From the examples in this RFC I can also see how the components C0-C2 can be used to support these non-local approaches. C0 + C1 allows to specify the constraints during scheduling, and propagate back to the graph. Besides them, I would also like to mention another component
* C3: ability to specify constraints for each operator.

It seems to me that C0, C1, C3 are actually choices of implementation as there are multiple ways that require a combination of them to achieve the goal of constraint flowing.
* C0 + C1 (which imply C3 are satisfied) suggests implementing the constraints at TIR level using `BufferConstraint`. To propagate back the constraints to the graph, which is `Tensor` central, it seems the graph-level counterpart of `BufferContraints` is not clear, as @wrongtest mentioned.
* C3 is also feasible purely in the graph, which requires some mechanism to register per-operator constraints. An example I came up with is each operator can have a list of supported layout, and the constraint solver can choose layout for each operator to approximate the global optimum for the graph. This satisfies the need for non-local approaches but doesn't need TIR level constraints. Padding, instead of explicitly inserting `transform` / `inv_transform`, is also achievable as graph-level constraint flowing.

Back to the discussion of this RFC, I think the main comments about the proposed methods is IR changes required (which may have greater impacts on the existing TIR and scheduling), and the complexity involved using the new schedule primitive to reach the final desired state. From my understanding, the intention of these new primitives is to allow arithmetic simplification to perform graph rewriting like over-computation. If this can be achieved as graph-level rewriting rule (perhaps simpler as it doesn't need arithmetic manipulations), personally I think that would still be preferred for better maintainability. Also I'd like to mention that modeling such rewriting in the graph doesn't necessary tie the TIR operator with a specific graph IR implementation. As we are moving to S-TIR scheduling, it is easy to apply some preprocessing steps to derive the PrimFunc in specific layout from a standard `te.compute` definition.

Finally, I would like to encourage us to focus on the e2e goals. It seems the current approaches, either implemented as A0 or A1 in graph-level, should suffice the use cases in the inference graph. Though the training graph is probably not an immediate need, if we would like to consider their use cases, probably having some concrete examples with desired result can guide us to make better decision.

--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1155766862
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Tianqi Chen <no...@github.com.INVALID>.

cc @Hzfengsy @wrongtest-intellif  it would be great if you can also take a followup look

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1180607081
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Chris Sullivan <no...@github.com.INVALID>.

Thanks everyone for the very fruitful discussions! We indeed have a good path forward and are aligned on the principles that for the end to end optimization we will maintain function interface invariance and achieve graph level layout optimization via a combination of local decisions, reconstruction with assumptions, and rewriting based on the result of graph level analysis and planning. 

I would ask that we move this discussion into a final comment period as we would like to soon open a tracking issue for the items described in the RFC. 

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1181142880
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

> Indeed it is important to avoid having a separate compute definition for each workload on a new target. In this particular case, all computation definition would start with the original layout. Then there is a "schedule transformation" like transform layout which will generate the new stage as part of the scheduling process.

Thank you, and that is roughly how I'm seeing it as well.  That everything starts with the base compute definition and is modified from there.  If I understand correctly, the main differences are below.

* Option A: Layout transformations of inputs are allowed, but only during initial graph-level optimization.  When optimizing an individual PrimFunc, layout transformations of inputs and outputs are not allowed.

* Option B: Layout transformations of inputs and outputs are not allowed.  If this is desired, it should be done by first introducing a cache stage in TIR, then transforming the layout of the cache, and finally by a graph-level transformation that inspects each PrimFunc and hoists the cache stage out.

> The particular stage can be marked, which contains effectively the same information as BufferConstraint, except that it does not introduce new data structures. During global layout reflowing, such information can be used to guide the reflowing to reconstruct a data structure like BufferConstraint or other Layout mappings and use that to serve the same purpose.

So long as the constraints can be statically searched for, this approach makes sense to me.  I would be more concerned about adding additional semantics to existing nodes, such as a AttrStmt node, since it then requires passes to be aware not only of the existence of the constraint, but also that it must be reconstructed from the existing data structure.  This approach would make it much more difficult for a static analysis tool to identify locations where the constraints must be updated.

As a way to potentially find a way forward, what if we start by implementing pad values only for buffers that are allocated internally to a function?  This would be allowed behavior under both Option A and Option B, and would help determine how difficult reconstruction of the constraints would be from the transformation block without any additional annotation.  This could help motivate whether additional annotations are necessary, regardless of whether they are stored alongside the Buffer itself or in a separate attribute/annotation.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163436177
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Posted by Eric Lunderberg <no...@github.com.INVALID>.

> Talking about “constraints”, it is also useful to talk about categories of them, roughly we can divide them into three categories.

I like this breakdown, and agree.  In this categorization, what I've been calling "constraints" would be "assumptions".  Double-checking in `builtin.h`, it looks like we don't currently have a TIR equivalent of `__builtin_assume`.

For usage of assumptions, I think the key would be to insert an assumption whenever the information that could otherwise prove it is hoisted out of the PrimFunc.  That would provide non-local information that could be used by the PrimFunc to allow local simplifications.

> transformation of PrimFunc do not change the PrimFunc interface: this is really important so we can transform a PrimFunc without worrying about how the graph interacts with it(as the interface remains the same, we can lift out the blocks earlier)

I don't think we can make this strong of a statement, as it would also forbid fusing operators together or hoisting a stage out of a PrimFunc.  In both cases, the signature of the resulting PrimFunc may be different than it was before.  This shows up in the example, as the interface of `grow` is different from the transformed `grow_packed`.

As a slightly less general statement, I would say that transformations of a PrimFunc *in isolation* may not change the PrimFunc's interface. So an optimization search to improve the performance of a single subgraph may not change the layout of its own arguments, nor may it change assumptions of what is present in the padding, as those would change its interface.  However, a graph-level transform would be allowed to fuse subgraphs, to hoist stages out of a PrimFunc, to alter the layout of a PrimFunc's input, or to alter the assumptions provided about the inputs.  In general, a PrimFunc's interface could only be changed when calls into the PrimFunc are also modified to remain compatible.

Is there a better term than "scheduling primitive" to describe layout transformations that impact input/output buffers?  I think the difference is between context-independent transformations that may be performed on a PrimFunc without changing, as opposed to context-dependent transformations that may only be performed as part of a graph-level transformation.



> Each function needs to have its own TIR analysis of how it flows things back, for example, in the case of `addone`, we can safely flow PadMapping back, changing `addone` to `addone_packed` by analyzing the TIR. If the addone is elemwise exp however, we need to insert a select operator(because `exp(0)=1` ) the message to input becomes `PadMapping(constraint, pad_value=undef)`.

Would this handle cases where there are multiple different options for how an operator could be implemented?  Otherwise, I'm not sure how this would handle cases where multiple different sets of layouts/constraints could be inferred from different TIR-level schedules of the same operator.  As examples, the drop-down has 6 different implementations of `addone`, each of which would allow different hoistable pad/crop operations.

<details>
<summary>Click to expand</summary>
<br>

```python
# Implementation 1, no preproc/postproc are present.
#
# No hoistable layout transformations.  Could be fused with a layout
# transformation, but doesn't otherwise provide any constraints.
@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    for i in T.serial(14):
        with T.block("compute"):
            B[i] = A[i] + 1


# Implementation 2, pad input/output, but never access the padding of
# either input or output.
#
# In back-propagation of constraints, the T.undef() that is cropped
# from BC could be narrowed to a known value provided from the
# successor.  However, AC's padding is never written to, so could
# propagate T.undef() back to preceding function.
@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    for io, ii in T.grid(4, 4):
        with T.block():
            T.block_attr("preproc", "pad")
            if 4 * io + ii < 14:
                AC[io, ii] = A[4 * io + ii]

    for i in T.serial(14):
        with T.block("compute"):
            BC[i // 4, i % 4] = AC[i // 4, i % 4] + 1

    for i in T.serial(14):
        with T.block():
            T.block_attr("postproc", ["crop", T.undef()])
            B[i] = BC[i // 4, i % 4]


# Implementation 3, pad input with known value, but never access
# padding of output.
#
# In back-propagation of constraints, the T.undef() that is cropped
# from BC could be narrowed to a known value provided from the
# successor.  AC's padding is written to, so this would propagate
# `PadMapping(predicate, pad_value=0)` to the previous operator.
@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    for io, ii in T.grid(4, 4):
        with T.block():
            T.block_attr("preproc", "pad")
            AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)

    for i in T.serial(14):
        with T.block("compute"):
            BC[i // 4, i % 4] = AC[i // 4, i % 4] + 1

    for i in T.serial(14):
        with T.block():
            T.block_attr("postproc", ["crop", T.undef()])
            B[i] = BC[i // 4, i % 4]


# Implementation 4, pad input with arbitrary value, provide no
# guarantees in output.
#
# In back-propagation of constraints, the T.undef() that is cropped
# from BC could be narrowed to a known value provided from the
# successor.  AC's padding is written to, so this would propagate
# `PadMapping(predicate, pad_value=BC_pad_value - 1)` to the
# previous operator.
@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    for io, ii in T.grid(4, 4):
        with T.block():
            T.block_attr("preproc", "pad")
            AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + ii], T.undef())

    for io, ii in T.grid(4, 4):
        with T.block("compute"):
            BC[io, ii] = AC[io, ii] + 1

    for i in T.serial(14):
        with T.block():
            T.block_attr("postproc", ["crop", T.undef()])
            B[i] = BC[i // 4, i % 4]


# Implementation 5, pad input with known value, analysis of TIR
# successfully propagates pad value through to provide assumption when
# cropping.
#
# In back-propagation of constraints, the output assumption is fixed.
# Unless the operator following addone has included the constraint 1
# as the required value in its padding, the crop/pad pair wouldn't be
# able to be removed.  AC's padding is written to, and would propagate
# `PadMapping(predicate, pad_value=0)` to the previous operator.
@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    for io, ii in T.grid(4, 4):
        with T.block():
            T.block_attr("preproc", "pad")
            AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)

    for io, ii in T.grid(4, 4):
        with T.block("compute"):
            BC[io, ii] = AC[io, ii] + 1

    for i in T.serial(14):
        with T.block():
            T.block_attr("postproc", ["crop", 1])
            B[i] = BC[i // 4, i % 4]


# Implementation 6, pad input with known value, analysis of TIR can't
# successfully propagate pad value through to the output.
#
# In back-propagation of constraints, the output assumption is fixed.
# Since we don't provide an assumption of what will be returned, the
# graph-level pair of `crop(T.undef())` followed by `pad(x)` could
# only be canceled out if `x` is `T.undef()`.  AC's padding is written
# to, and would propagate `PadMapping(predicate, pad_value=0)` to
# the previous operator.
@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
    for io, ii in T.grid(4, 4):
        with T.block():
            T.block_attr("preproc", "pad")
            AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)

    for io, ii in T.grid(4, 4):
        with T.block("compute"):
            BC[io, ii] = AC[io, ii] + 1

    for i in T.serial(14):
        with T.block():
            T.block_attr("postproc", ["crop", T.undef()])
            B[i] = BC[i // 4, i % 4]
```

</details>

I think the main change is that the temporary stages with annotation will need to allow multiple possibilities, rather than a single definitive layout.  These options could then be searched at the graph-level to decide on the appropriate layout.  After that is decided, the tempoerary stage could be selected and the transformations hoisted.


> But extra amount of care is needed when we attempt to move `crop_with_pad_assume`, as it really depends on the value property of its input.

Completely agreed.  I think this is true at both the TIR and graph levels, that allowing assumptions means ensuring that the assumption isn't changed after it is used for simplifications.  The advantage of writing the assumptions at the graph level is that specific pairs of functions (such as `crop_with_pad_assume(pad_value)` followed by `pad_with_value(pad_value)`) can be identified as no-ops, without needing a full proof of it.

I think the main rules that would need to be followed when handling assumptions would be the following three.

1. An assumption may be inserted wherever it can be statically proven, or asserted by a user about user-supplied input.
   
2. An assumption may be removed only if it can be statically proven. Assertions from a user about user-supplied input may never be removed, as they may have already been used to perform irreversible simplifications.
   
3. Static provers must reset all assumptions about a variable when `T.undef()` is assigned to it, even though these assignments are removed during lowering.

The restriction against changing a PrimFunc's interface fall out directly from rule #1.  Since an assumption that restrict values of an input cannot be proven, these assumptions may not be modified.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1165713753
You are receiving this because you are subscribed to this thread.

Message ID: <ap...@github.com>